RFC: IR metadata format for MemProf

This RFC describes the IR metadata format for the sanitizer-based heap profiler (MemProf) data when it is fed back into a subsequent compile for profile guided heap optimization. Additional background can be found in the following RFCs:

  • RFC: Sanitizer-based Heap Profiler [1]

  • RFC: A binary serialization format for MemProf [2]

We look forward to your feedback.

Authors: tejohnson@google.com, snehasishk@google.com, davidxl@google.com

Requirements

The profile format will need to be reasonably compact within IR, but also facilitate efficiently identifying callsites within the stack contexts for use in interprocedural (and cross module) optimization for context disambiguation. These optimizations will require transformations to disambiguate the context at a particular allocation, which will require call graph based analysis and optimization.

Input Profile Format

The profile will be in an index binary format, which is detailed in [2]. In the index format, the profiles for each allocation site will be located with the profile data for the function containing the allocation call site. Each allocation may have multiple profile entries (known as a MIB, or Memory Info Block) uniquely identified by stack context. The entries in the stack context will be symbolized and include file, line and discriminator.

Metadata Format

Similar to branch weights and value profile data from regular PGO, the PGHO profile will be annotated as metadata onto relevant instructions. A natural instruction to attach the profile metadata is on the allocation callsite, so these allocation calls can be identified and handled by the subsequent heap optimization pass. As an example, this profile data can be used to enable automatic application of hot and cold hints to allocations, for use by a runtime allocator such as tcmalloc, where support for such allocation hints was recently added [3].

However, in order to identify ancestor callsites within an allocation’s call stack context that require modification for disambiguating the context at the allocation site, e.g. via cloning, we will also want to attach metadata to these callsites. This is particularly important for contexts that cross module boundaries, so that we can identify them in ThinLTO summaries for cross module coordination of context transformations.

To identify and correlate entries in a context, we will use a unique identifier for each stack entry. Specifically, we will use the 64-bit value from the stack entry table in the indexed profile format which is formed from the index into the file path table along with the line and discriminator. Another option would be to represent the stack entries using existing debug metadata. However, for stack entries in another module we would need to synthesize additional debug location metadata in the module containing MIB profile data that references that stack context entry.

Assume the following working example. For simplicity, all are shown as being in the same module, however, these function definitions could theoretically be located in multiple different modules.

x.cc

1 main() {

2 foo(); // stack entry id: 123

3 }

4

5 foo() {

6 baz(); // stack entry id: 234

7 }

8

9 baz() {

10 if (x)

11 bar(); // stack entry id: 345

12 else

13 bar(); // stack entry id: 456

14 }

15

16 bar() {

17 malloc(4); // stack entry id: 567

18 }

The call to malloc has 2 possible calling contexts:

  1. main → foo (x.cc:2) → baz (x.cc:6) → bar (x.cc:11) → malloc (x.cc:17)

  2. main → foo (x.cc:2) → baz (x.cc:6) → bar (x.cc:13) → malloc (x.cc:17)

where the stack entry id for each callsite, taken from the profile’s stack entry table contents, is shown in the code comments. The corresponding full contexts in terms of stack entry ids, listed from the leaf allocation callsite up to the root are:

  1. 567, 345, 234, 123

  2. 567, 456, 234, 123

Assuming both contexts execute at runtime, the allocation will end up with 2 MIBs in the profile, one for each of the above contexts.

To represent this in the IR, we propose 2 new metadata attachment types, as described below.

Callsite metadata

The !callsite metadata is used to associate a callsite with its corresponding references in MIB stack contexts. It contains the associated 64-bit stack entry table value for that callsite from the indexed profile, and is initially only on non-allocation callsites. As will be described later, after inlining it can contain multiple entry ids or be propagated onto allocation callsites.

In the above example, for the call to foo(), which had stack entry id 123, the IR callsite would be decorated with a !callsite metadata containing that stack entry id:

tail call void @_Z3foov(), !dbg !12, !callsite !14

!14 = !{i64 123}

Note that this call may be in a different module initially than the referencing MIB metadata. In order to disambiguate the context across modules, some form of LTO would be required. ThinLTO summary support will be added to reflect the cross-module contexts and enable cross module optimization of the contexts.

Also, while for MemProf the ids will be assigned uniquely using information from the MemProf profile, other types of context sensitive profiles could simply reuse the same id after matching with line table information, or at least leverage the same metadata attachment to assign their own unique ids if there is no MemProf profile.

Memprof metadata

The !memprof metadata describes the MIBs for the leaf allocation callsite it is attached to. If there are multiple stack contexts leading to that allocation, it will have a single !memprof metadata attachment, with a level of indirection used to list all related MIB, as shown in the later example.

As with the indexed profile format, we need to be able to add or modify fields of the MIB entries while maintaining backwards compatibility with older bitcode. Therefore, we use a schema format with the MIB profile entry fields described by a “Memprof Schema” module level metadata, for example:

!llvm.module.flags = !{!1}

!1 = !{i32 1, !”Memprof Schema”,!”Stack”, !”AllocCount”, !”AveSize”, !”MinSize”, !”MaxSize”, !”AveAccessCount”, !”MinAccessCount”, !”MaxAccessCount”, !”AveLifetime”, !”MinLifetime”, !”MaxLifetime”, !”NumMigration”, !”NumLifetimeOverlaps” }

The first (merge behavior) field is 1 (ModFlagBehavior::Error), meaning that it is an error to merge modules with different values, or in other words, merging modules compiled with different profiles generated with different versions of the indexed profile format.

Assume we are using the schema shown in the above module flag metadata. In the earlier example, where allocation was reached by 2 different profile contexts (and therefore had 2 MIBs), the !memprof metadata would be structured similar to the following:

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !261, !memprof !13

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32 20, i32 40, i32 0, i32 0}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32 10, i32 10, i32 0, i32 0}

The first memprof metadata entry references another metadata containing the call stack context, which contains entries that as described earlier are unique 64-bit values used to identify stack entries in the indexed profile. These stack entry ids enable correlation with the !callsite metadata attached to calls, particularly across modules. Note that we don’t need to include a stack entry id for the allocation callsite, since the metadata is already correlated with it by attachment. Therefore, the contexts for the two MIB shown here would contain the list of ids described for the example, minus 567 for the allocation callsite. So “345, 234, 123” and “456, 234, 123”.

This call stack metadata (e.g. !15 and !17 in the above example), should identify the stack entries in order from the leaf allocation’s caller to the root caller. There are several options for organizing this metadata. One simple possibility is to list all stack entry ids exhaustively in each call stack. For example:

!15 = !{i64 456, i64 234, i64 123}

!17 = !{i64 345, i64 234, i64 123}

However, as can seen by the above example, this results in a lot of duplication. Instead, we will organize the call stack metadata as a chain of metadata, one stack entry id per metadata, with a second operand pointing to the next (caller) stack entry in the chain. This allows for deduplication in the above case, which then looks like:

!15 = !{i64 456, !18}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

While in this toy example the chaining results in overall more metadata instructions and operands, in reality the contexts are much longer, and as we will show later, the opportunities to deduplicate stack entries is large in practice.

We may not be able to completely deduplicate all instances of a particular stack entry id. If we additionally had a call stack with entries 567, 234, 678, 123, although 234 is the same id shown in the earlier call stacks, it has a different caller (678) so we cannot share the same metadata node used for 234 in the above metadata. We will instead need to essentially duplicate it, like:

!20 = !{i64 567, !21}

!21 = !{i64 234, !22}

!22 = !{i64 678, !19}

The last entry above can and does point to the same (root stack entry) metadata !19 used for entry 123 by the earlier call stacks.

A measurement of the effectiveness of this deduplication strategy for 3 large internal applications showed we could reduce the number of stack entry ids by 75-88%. This was measured across the entire profile, so the deduplication effectiveness will be smaller within each IR module, where there are only a subset of call stacks with which to deduplicate. Additionally, the deduplication strategy requires additional Metadata nodes and operands for the caller Metadata pointers. Considering these effects leads to estimates of 12-60% size overhead reductions, which has a smaller lower bound but wider distribution. However, the measurements suggest this strategy will lead to a net reduction in overhead.

Inlining

When calls are inlined after annotation, the relevant !callsite metadata can be merged, with the ordering of the entries metadata implying the inlined callee->caller relationships, and if relevant added onto the inlined allocation callsite (which will then have both !callsite and !memprof metadata.

For example, assuming we have the following IR snippets for the original example (for simplicity, all in the same Module):

define dso_local i32 @main() local_unnamed_addr #0 !dbg !80 { …

tail call void @_Z3foov(), !dbg !121, !callsite !1

…

define dso_local void @_Z3foov() local_unnamed_addr #0 !dbg !81 { …

tail call void @_Z3bazv(), !dbg !111, !callsite !2

…

define dso_local void @_Z3bazv() local_unnamed_addr #0 !dbg !82 { …

; if (x)

tail call void @_Z3barv(), !dbg !11, !callsite !3

…

; else

tail call void @_Z3barv(), !dbg !12, !callsite !4

…

define dso_local void @_Z3barv() local_unnamed_addr #0 !dbg !258 { …

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !261, !memprof !13

…

!1 = !{i64 123}

!2 = !{i64 234}

!3 = !{i64 345}

!4 = !{i64 456}

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32 20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32 10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

If we inlined bar() into both calls to baz(), and then into foo() and from there into main() - i.e. all the way up to the root of the call stack - the resulting allocation calls which are eventually inlined into main() would look like:

define dso_local i32 @main() local_unnamed_addr #0 !dbg !8 { …

; if (x)

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !26, !memprof !16, !callsite !17

…

; else

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !27, !memprof !14, !callsite !15

…

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32 20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32 10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

In the above case since we inlined all the way up to root main, the merged !callsite metadata on each inlined allocation callsite is the same as the list in the “Stack” metadata on the associated !memprof metadata, so we can reuse them (!17 and !15) on the inlined allocation !callsite metadatas.

Note that the original allocation had multiple memprof metadata corresponding to different stack contexts. When we inline, we should prune those from the inlined allocation call that do not start with the stack subsequence described in the concatenated stack entry profile metadata. In the above example, the inlined callsites have contexts implied by their new !callsite metadata that match exactly one of the original MIBs, and we prune the other. If we had not inlined all the way up into the root main, we might have multiple MIB entries whose stack contexts include the inlined stack entry sequence in the !callsite metadata and they should all be kept on the inlined allocation callsite.

As another example, if we only inlined bar into baz the inlined allocations in baz would look like:

define dso_local void @_Z3bazv() local_unnamed_addr #0 !dbg !82 { …

; if (x)

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !26, !memprof !16, !callsite !3

…

; else

%call = tail call noalias align 16 dereferenceable_or_null(4) i8* @malloc(i64 4) #4, !dbg !27, !memprof !14, !callsite !4

…

!3 = !{i64 345}

!4 = !{i64 456}

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32 20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32 10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

In this case the !callsite metadata on the inlined allocations is the same as that on the original calls to bar(), i.e. a single stack entry id each, since we haven’t inlined across multiple calls with !callsite metadata. But we can still prune the MIB with stack contexts that don’t start with that of the !callsite metadata on the inlined call.

And alternatively if we only inlined foo into main, main’s call of bar would now have !callsite metadata that has a child pointer to the !callsite metadata of the now inlined call to foo() (!1):

define dso_local i32 @main() local_unnamed_addr #0 !dbg !8 {

…

tail call void @_Z3barv(), !dbg !11, !callsite !2

…

}

!1 = !{i64 123}

!2 = !{i64 234, !1}

Keeping the !callsite metadata from the inlined callsites and concatenating them in an upward call stack order will be important when we later determine what type of cloning or other context disambiguating optimizations are required.

[1] RFC: Sanitizer-based Heap Profiler (https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html)

[2] RFC: A binary serialization format for MemProf (https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html)

[3] Implement interfaces for providing access frequency hints to TCMalloc (https://github.com/google/tcmalloc/commit/ab87cf382dc56784f783f3aaa43d6d0465d5f385)

One change below along with another clarification. Also, you can view the original RFC here with better formatting for the examples: https://groups.google.com/g/llvm-dev/c/aWHsdMxKAfE/m/WtEmRqyhAgAJ

This RFC describes the IR metadata format for the sanitizer-based heap profiler (MemProf) data when it is fed back into a subsequent compile for profile guided heap optimization. Additional background can be found in the following RFCs:

  • RFC: Sanitizer-based Heap Profiler [1]

  • RFC: A binary serialization format for MemProf [2]

We look forward to your feedback.

Authors: tejohnson@google.com, snehasishk@google.com, davidxl@google.com

Requirements

The profile format will need to be reasonably compact within IR, but also facilitate efficiently identifying callsites within the stack contexts for use in interprocedural (and cross module) optimization for context disambiguation. These optimizations will require transformations to disambiguate the context at a particular allocation, which will require call graph based analysis and optimization.

Input Profile Format

The profile will be in an index binary format, which is detailed in [2]. In the index format, the profiles for each allocation site will be located with the profile data for the function containing the allocation call site. Each allocation may have multiple profile entries (known as a MIB, or Memory Info Block) uniquely identified by stack context. The entries in the stack context will be symbolized and include file, line and discriminator.

Metadata Format

Similar to branch weights and value profile data from regular PGO, the PGHO profile will be annotated as metadata onto relevant instructions. A natural instruction to attach the profile metadata is on the allocation callsite, so these allocation calls can be identified and handled by the subsequent heap optimization pass. As an example, this profile data can be used to enable automatic application of hot and cold hints to allocations, for use by a runtime allocator such as tcmalloc, where support for such allocation hints was recently added [3].

However, in order to identify ancestor callsites within an allocation’s call stack context that require modification for disambiguating the context at the allocation site, e.g. via cloning, we will also want to attach metadata to these callsites. This is particularly important for contexts that cross module boundaries, so that we can identify them in ThinLTO summaries for cross module coordination of context transformations.

To identify and correlate entries in a context, we will use a unique identifier for each stack entry. Specifically, we will use the 64-bit value from the stack entry table in the indexed profile format which is formed from the index into the file path table along with the line and discriminator. Another option would be to represent the stack entries using existing debug metadata. However, for stack entries in another module we would need to synthesize additional debug location metadata in the module containing MIB profile data that references that stack context entry.

Assume the following working example. For simplicity, all are shown as being in the same module, however, these function definitions could theoretically be located in multiple different modules.

x.cc

1 main() {

2 foo(); // stack entry id: 123

3 }

4

5 foo() {

6 baz(); // stack entry id: 234

7 }

8

9 baz() {

10 if (x)

11 bar(); // stack entry id: 345

12 else

13 bar(); // stack entry id: 456

14 }

15

16 bar() {

17 malloc(4); // stack entry id: 567

18 }

The call to malloc has 2 possible calling contexts:

  1. main → foo (x.cc:2) → baz (x.cc:6) → bar (x.cc:11) → malloc (x.cc:17)

  2. main → foo (x.cc:2) → baz (x.cc:6) → bar (x.cc:13) → malloc (x.cc:17)

where the stack entry id for each callsite, taken from the profile’s stack entry table contents, is shown in the code comments. The corresponding full contexts in terms of stack entry ids, listed from the leaf allocation callsite up to the root are:

  1. 567, 345, 234, 123

  2. 567, 456, 234, 123

Assuming both contexts execute at runtime, the allocation will end up with 2 MIBs in the profile, one for each of the above contexts.

To represent this in the IR, we propose 2 new metadata attachment types, as described below.

Callsite metadata

The !callsite metadata is used to associate a callsite with its corresponding references in MIB stack contexts. It contains the associated 64-bit stack entry table value for that callsite from the indexed profile, and is initially only on non-allocation callsites. As will be described later, after inlining it can contain multiple entry ids or be propagated onto allocation callsites.

In the above example, for the call to foo(), which had stack entry id 123, the IR callsite would be decorated with a !callsite metadata containing that stack entry id:

tail call void @_Z3foov(), !dbg !12, !callsite !14

!14 = !{i64 123}

Note that this call may be in a different module initially than the referencing MIB metadata. In order to disambiguate the context across modules, some form of LTO would be required. ThinLTO summary support will be added to reflect the cross-module contexts and enable cross module optimization of the contexts.

Also, while for MemProf the ids will be assigned uniquely using information from the MemProf profile, other types of context sensitive profiles could simply reuse the same id after matching with line table information, or at least leverage the same metadata attachment to assign their own unique ids if there is no MemProf profile.

To clarify, the callsite metadata value can be any globally unique identifier. While in this proposal we simply describe using the indexed profile’s associated 64-bit stack entry table value, an alternative could be to compute this from the MD5 hash of the debug information (file:line:discriminator). It doesn’t affect the format and its usage described in this RFC.

Memprof metadata

The !memprof metadata describes the MIBs for the leaf allocation callsite it is attached to. If there are multiple stack contexts leading to that allocation, it will have a single !memprof metadata attachment, with a level of indirection used to list all related MIB, as shown in the later example.

As with the indexed profile format, we need to be able to add or modify fields of the MIB entries while maintaining backwards compatibility with older bitcode. Therefore, we use a schema format with the MIB profile entry fields described by a “Memprof Schema” module level metadata, for example:

!llvm.module.flags = !{!1}

!1 = !{i32 1, !”Memprof Schema”,!”Stack”, !”AllocCount”, !”AveSize”, !”MinSize”, !”MaxSize”, !”AveAccessCount”, !”MinAccessCount”, !”MaxAccessCount”, !”AveLifetime”, !”MinLifetime”, !”MaxLifetime”, !”NumMigration”, !”NumLifetimeOverlaps” }

The first (merge behavior) field is 1 (ModFlagBehavior::Error), meaning that it is an error to merge modules with different values, or in other words, merging modules compiled with different profiles generated with different versions of the indexed profile format.

Using module flags for the schema doesn’t work, because it only supports a single string tag and an integer flag value, not arbitrary contents. Instead, I have implemented this using a new named metadata:

!memprof.schema = !{!0}
!0 = !{!“Stack”, !“AllocCount”, !“AveSize”, !“MinSize”, !“MaxSize”, !“AveAccessCount”, !“MinAccessCount”, !“MaxAccessCount”, !“AveLifetime”, !“MinLifetime”, !“MaxLifetime”, !“NumMigration”, !“NumLifetimeOverlaps”}

Named metadata must only hold metadata nodes as operands. Here we use a single operand to point to metadata that describes the schema.

The advantage of using a single metadata operand in the new !memprof.schema metadata, vs for example a list of metadata operands each pointing to a single MDString schema field, is that it simplifies detection of different schemas when modules are merged for LTO. If the schemas are identical, the merged !memprof.schema metadata will continue to hold a single metadata node as the node holding the schema (!0 above) will be shared. If they are not identical, the merged module’s !memprof.schema metadata will hold more than one metadata operand, one for each unique schema.

An alternative, which is what I originally had in a prototype, was to include the MDString field tags in each !memprof metadata, for example:

!274 = !{!“Stack”, !273, !“AllocCount”, i32 1, !“AveSize”, i32 4, !“MinSize”, i32 4, !“MaxSize”, i32 4, !“AveAccessCount”, i64 5, !“MinAccessCount”, i64 1, !“MaxAccessCount”, i64 10, !“AveLifetime”, i32 10, !“MinLifetime”, i32 10, !“MaxLifetime”, i32 10, !“NumMigration”, i32 0, !“NumLifetimeOverlaps”, i32 0}

This provides maximal flexibility of merging modules using different schemas, at the expense of additional overhead in operands (doubling the number of operands of !memprof metadata). But if we need to support merging modules with different schemas, we could alternatively just support unifying the named metadata schemas and fixing up the associated !memprof metadata during the merging process, rather than carrying this extra overhead.

Hello Teresa, Snehasish and David,

Thanks for the RFC and follow-up clarification. I have a few questions regarding the use of the metadata and how they are manipulated.

Hello Teresa, Snehasish and David,

Thanks for the RFC and follow-up clarification. I have a few questions regarding the use of the metadata and how they are manipulated.

Hi Hongtao, responses below.

  1. While the proposed !callsite metadata and its compression through chaining sounds efficient, have you considered using the existing !dbg metadata as an alternative?

We could potentially do so for the !callsite metadata, but not for the stack ids in the !memprof metadata’s stack context (which could be in another module), and we need to be able to correlate these (i.e. identify the callsites corresponding to the stack context list on the !memprof metadata), using ThinLTO if they are across module boundaries. This idea and its downside is mentioned briefly near the beginning of the Metadata Format section:

Another option would be to represent the stack entries using existing debug metadata. However, for stack entries in another module we would need to synthesize additional debug location metadata in the module containing MIB profile data that references that stack context entry.

It is possible that if we used stack ids that were generated from the MD5 hash of the debug location, we could actually generate this id for callsites on the fly (from its debug metadata). However, having the !callsite metadata shows explicitly which callsites we need to consider for the MemProf optimizations (and also which to summarize in the ThinLTO summary in order to perform cross-module context disambiguation).

  1. Do !callisite metadata need to be maintained on MIR?

We don’t plan to as the targeted transformations would be in LLVM IR (e.g. context disambiguation via cloning or the like, modification of allocation calls).

  1. For the !callsite metadata, in order to make sure the metadata not accidentally dropped, how much extra care is needed in passes such as tail call optimization?

I believe tail call elimination happens in code gen so after we would be done with this metadata. Tail recursion elimination is earlier but we probably can and will want to handle direct recursion in the contexts specially anyway.

  1. When two callsites are merged by some CFG optimizations, how are their !callsite metadata handled?

Initially we can prevent this type of merging until after context disambiguation is complete, i.e. while they have !callsite metadata (in fact, we might need to keep the calls separate anyway if their respective contexts result in different behavior at leaf allocation sites). If this becomes too limiting we could probably extend the !callsite metadata to include “aliased” callsite ids from merged callsites.

  1. How do we identify a call path in the presence of indirect call sites?

If the indirect callsites aren’t already speculatively devirtualized via the value profile info, we can do so in the process of cloning for context disambiguation, because we know the caller from the full stack context.

  1. I was wondering how the !memprof metadata is consumed. Will they be passed into the runtime allocator in some way?

We plan to use it to set hints on allocations that are consumed by the allocator. See for example the recent patch to TCMalloc that adds hot/cold hints:

[3] Implement interfaces for providing access frequency hints to TCMalloc (https://github.com/google/tcmalloc/commit/ab87cf382dc56784f783f3aaa43d6d0465d5f385)

Thanks!
Teresa

Hi Teresa and others,

Could you, please, give an update on this project (MemProf)?

I see a steady stream of commits from Teresa and Snehasish, so I understand the project is actively going on; however, without reading all the code it’s hard to understand current status.

Thus, a quick update would be much appreciated.

Specifically:

  • What’s the current status?
  • When you expect to have an “MVP” version? (To clarify, I’m not asking for a commitment – it would be totally inappropriate to do so in an open-source community; just your “gut feeling” based on current progress)
  • Do you need any help?

Yours,
Andrey

Hi Andrey,

Thanks for the email! Answers below:

  • What’s the current status?

The commits you have been seeing (mostly from Snehasish) are the implementation of the binary profile format, both raw and indexed, along with support in the runtime to dump the raw format, and support in llvm-profdata to create the indexed format, including merging with a regular PGO profile. While this may continue to evolve a bit with future refinements as needed, we now basically have the support we need to feed the memprof profiles back into the compiler.

On my side, I have been working on the profile-use implementation. I had implemented the bulk of it awhile back, but didn’t want to send patches until I could do end-to-end testing with larger benchmarks, which the binary profile format has facilitated. I’m trying to wrap up the debugging and refinement of a first version so that I can start sending patches upstream.

The pieces I have been testing are roughly:

  • memprof profile matching
  • metadata format
  • ThinLTO summary format definition and generation
  • Cloning of callsites/functions for context disambiguation
  • Changing “cold” context allocation callsites to provide a hint (using interfaces defined in the open source tcmalloc).

The most complex piece is the cloning, which attempts to find the minimal number of clones required for any given callsite and function. I have also implemented the cloning support so that it operates on either IR or ThinLTO summary (the latter having companion ThinLTO import side support for updating the IR).

  • When you expect to have an “MVP” version? (To clarify, I’m not asking for a commitment – it would be totally inappropriate to do so in an open-source community; just your “gut feeling” based on current progress)

Some of this is answered above. More concretely, I have built a large c++ spec app (omnetpp) using the changes, and am testing with large internal applications. There are some cases, however, such as indirect calls, that the cloning is not currently handling and need a bit of additional support that I am sketching out.

My hope and plan is to start sending patches, at least for the matching and metadata, in the coming few weeks. I have a few improvements I would like to add to the ThinLTO summary support before I send that, and likely the cloning will need a bit more debugging and definitely some cleanup, but I would like to send all of these upstream this quarter. The changes to rewrite the allocation calls are straightforward, but of course won’t work when linking with anything other than tcmalloc right now, so likely I will want to send some patches to the llvm libc++ new to add support for the interfaces as well.

  • Do you need any help?

Thanks for asking! Probably the best help would be reviewing patches once I get them sent upstream, testing, and perhaps implementing follow on improvements and fixes.

Teresa

Hi Teresa,

Thank you so much for the update!

We (Huawei) are very interested in this direction (efficient data cache utilization) and actually currently exploring an alternative approach internally (it’s too early to share any details before we become more confident that our approach works; when (and if) this will happen, we’ll be happy to share more with the community).

In the meantime we would be glad to test your developments on our internal applications and hopefully also help with code reviews.

Yours,
Andrey

Hi @teresajohnson
I see in the tcmalloc patch, some extension version operator new functions which have additional tcmalloc::hot_cold_t hot_cold argument are added.
Is this mean memprof will have an LLVM pass which change the normal operator new callsites to the tcmalloc::hot_cold_t hot_cold version according to the !memprof metadata?
If so, what about applications not using tcmalloc?
Thanks!

| Enna1
November 24 |

  • | - |

teresajohnson:

We plan to use it to set hints on allocations that are consumed by the allocator. See for example the recent patch to TCMalloc that adds hot/cold hints:

teresajohnson:

Changing “cold” context allocation callsites to provide a hint (using interfaces defined in the open source tcmalloc).

Hi @teresajohnson
I see in the tcmalloc patch, some extension version operator new functions which have additional tcmalloc::hot_cold_t hot_cold argument are added.
Is this mean memprof will have an LLVM pass which change the normal operator new callsites to the tcmalloc::hot_cold_t hot_cold version according to the !memprof metadata?

My prototyped changes, which I am preparing to send for review, adds the functionality to simplifyLibCalls to transform new allocations with a cold attribute to the hot/cold hinting operator new interfaces when they are available via TLI.

Note I have changed the upstream tcmalloc implementation so that the hot_cold_t parameter is in the global namespace so it no longer contains “tcmalloc” in the parameter name (https://github.com/google/tcmalloc/commit/904ca016ac0a7adac2e44f808e14cecb6547e120), so hopefully these interfaces can be adopted elsewhere and are more acceptable for upstream use.

If so, what about applications not using tcmalloc?

Other allocation libraries would need to add a similar operator new taking a hot_cold_t parameter to be able to use this particular part of the memprof related changes.

Teresa

1 Like

Hi Teresa,

Thanks for all the work you put into pioneering memprof. We’ve been following the project on the side and we noticed that recently MemProfContextDisambiguation pass and hot/cold hint passing are all in place. We are curious about the current state of the project:

  1. Is it ready for others to try/experiment (either using tcmalloc or other allocator that accept hot/cold hint)?
  2. Have you got performance results on your end, and is there any numbers you could share for memprof?

Thanks!
Wenlei

Hi Wenlei,

Thanks for reaching out!

  1. Is it ready for others to try/experiment (either using tcmalloc or other allocator that accept hot/cold hint)?

Yes! @snehasish is going to follow up to this response with some examples showing the steps. Note that tcmalloc currently only treats cold hinted data differently (while there is some basic support in LLVM now for marking some allocations as hot, that isn’t currently treated any differently by tcmalloc - also the hot hint marking is currently quite simple and only applies to allocations that don’t need cloning, only cold allocations currently are cloned as needed to expose the context sensitive behavior). Please let us know if you run into issues or have questions. Note the cloning code is currently conservative in the face of indirect (virtual) calls - something I will be trying to improve next.

  1. Have you got performance results on your end, and is there any numbers you could share for memprof?

Right now we are focused on productization and working with some of our partner teams to evaluate the improvements in production to metrics like dTLB and zswap usage. We have some promising results from smaller scale load tests but unfortunately not that we can share at this time. We hope to share more data though after broader testing.

Teresa

@WenleiHe tcmalloc is the only allocator which supports hot/cold hints drive to page allocation policies AFAIK. So you’ll have to link with a version of tcmalloc built from the source available on GitHub - google/tcmalloc. Take a look at the operator new extensions defined here.

Collect a raw memprof profile
To automate the generation of memprof testdata I added some scripts in âš™ D145644 [memprof] Add scripts to automate testdata regeneration. This would be the easiest place to get started, both in terms of the compiler options required today as well as toy examples with hot/cold allocations. Note that some of the options such as -Wl,--no-rosegment and -Wl,-build-id are necessary for symbolization of the profile today. We plan to address these rough edges in the future. Other options such as -fno-omit-frame-pointer and -fno-optimize-sibling-calls are necessary for accurate stack unwinding so that annotations can be applied during profile use.

Post-process a memprof profile
We need the binary to create an indexed profile from the raw profile using llvm-profdata merge. An example invocation can be found in the test here.

Also note that an instrumented PGO profile can collected at the same time (in the previous step) and merged with the memprof raw profile in the this step as shown in the test above.

Use a memprof profile
To feed the profile back into the compiler, we expect the same set of options with the addition of a -fprofile-use with the path to the indexed profile generated from the step above. Strictly speaking, the -fno-optimize-sibling-calls flag is not required in this profile use step. Note that we do need -fdebug-info-for-profiling -gmlt options to ensure that the correct callsites are annotated.

Next add the following options to the clang invocation:

# We need Thin-LTO for cloning
-flto=thin
# Enable the whole program analyisis needed for cloning and 
# emit operator new calls with hints.
-Wl,-mllvm,-enable-memprof-context-disambiguation -Wl,-mllvm,-supports-hot-cold-new -Wl,-mllvm,-optimize-hot-cold-new

And finally don’t forget the flags for a statically or dynamically linked tcmalloc.

Let us know if you run into any issues, we’d be happy to take a look. The infrastructure right now is a bit specialized for our use and we may uncover issues while using OSS tooling.

Thanks for the update and instructions Teresa and Snehasish.

We noticed the new tcmalloc interface that accepts hot/cold hints, but didn’t see allocator really taking advantage of these new hints. Are we missing something or are you experimenting with an internal tcmalloc version? Eventually we may want to tweak jemalloc for our use case if we see good potential.

All of the tcmalloc components for hot cold hints are available externally for a while now. At a high level the hints specialize the hugepage aware backend to place known cold allocations on 4K pages rather than 2M pages. Given the interest, we will publish documentation on the hot cold interface for tcmalloc shortly. I’ll update this thread when it is in place.

Additional documentation and pointers to the code have been added
to tcmalloc: See Add documentation for hot_cold interface for new operator. · google/tcmalloc@c2bed55 · GitHub

Hi @teresajohnson @snehasish
I try to feed the profile back into the compiler as
clang++ -fmemory-profile-use=a.tmp test_malloc_load_store.cpp -S -emit-llvm -O1
I can’t see attribute like "memprof"="hot". I can’t find the code where add attribute “memprof” also. Can you tell me how to feed the profile back into the compiler and where compiler add attribute “memprof”?

I assume from the name that you are using compiler-rt/test/memprof/TestCases/test_malloc_load_store.c. This works for testing the profiling infrastructure, but the compiler feedback currently restricts it to operator new, because of the current use of the information for modifying operator new calls to add hints taking advantage of placement new (malloc cannot be modified that way).

If there are other uses in the optimizer that would benefit from having this info on malloc calls it would be fine to remove this restriction (see https://github.com/llvm/llvm-project/blob/e7b2855787cab5ccaf195b9c86985f263eb29cfe/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp#L828).

If you are just testing this out, try a case with an operator new. If that doesn’t work, please send me the reproducer and I’ll take a look.

Let me know if you have other questions or issues.

Teresa

1 Like

Hi, @teresajohnson @snehasish
It seems MemProf not be supported in distributed thinLTO. In thin backend, it seems !memprof and !callsite can’t be comsumed.

We use this exclusively in distributed ThinLTO mode. Can you provide the error you are seeing and any repro instructions?

Hi @teresajohnson
I ran test by separating source code in llvm-project/llvm/test/ThinLTO/X86/memprof-funcassigncloning.ll to a.cpp and main.cpp.
main.cpp

 #include <cstring>
 #include <unistd.h>
 extern void E(char **buf1, char **buf2);

 extern void B(char **buf1, char **buf2);

 extern void C(char **buf1, char **buf2);

 extern void D(char **buf1, char **buf2);


 int main(int argc, char **argv) {
   char *cold1, *cold2, *default1, *default2, *default3, *default4;
   B(&default1, &default2);
   C(&default3, &cold1);
   D(&cold2, &default4);
   memset(cold1, 0, 10);
   memset(cold2, 0, 10);
   memset(default1, 0, 10);
   memset(default2, 0, 10);
   memset(default3, 0, 10);
   memset(default4, 0, 10);
   delete[] default1;
   delete[] default2;
   delete[] default3;
   delete[] default4;
   sleep(10);
   delete[] cold1;
   delete[] cold2;
   return 0;
 }

a.cpp

 void E(char **buf1, char **buf2) {
   *buf1 = new char[10];
   *buf2 = new char[10];
 }

 void B(char **buf1, char **buf2) {
   E(buf1, buf2);
 }

 void C(char **buf1, char **buf2) {
   E(buf1, buf2);
 }

 void D(char **buf1, char **buf2) {
   E(buf1, buf2);
 }

No distributed thinLTO mode
In no distriubted thinLTO mode, I can see the behavior about cloning and replacing operator new().

/data00/lifengxiang.1025/llvm_newest/build/bin/clang++ -fmemory-profile-use=a.tmp   -O1 -gmlt -fdebug-info-for-profiling  -Wl,-mllvm,-enable-memprof-context-disambiguation -Wl,-mllvm,-supports-hot-cold-new -Wl,-mllvm,-optimize-hot-cold-new -fuse-ld=lld  -flto=thin -fno-inline a.cpp main.cpp  -mllvm --memprof-ave-lifetime-cold-threshold=5  
ld.lld: error: undefined symbol: operator new[](unsigned long, __hot_cold_t)
>>> referenced by a.cpp:2 (/data00/lifengxiang.1025/test/MemProfContextDisambiguation/a.cpp:2)
>>>               lto.tmp:(E(char**, char**) (.memprof.2))
>>> referenced by a.cpp:3 (/data00/lifengxiang.1025/test/MemProfContextDisambiguation/a.cpp:3)
>>>               lto.tmp:(E(char**, char**) (.memprof.3))
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

Distributed thinLTO mode
In distributed thinLTO mode.

  1. Thin compile
/data00/lifengxiang.1025/llvm_newest/build/bin/clang++ -fmemory-profile-use=a.tmp   -O1 -gmlt -fdebug-info-for-profiling  -mllvm -enable-memprof-context-disambiguation -mllvm -supports-hot-cold-new -mllvm -optimize-hot-cold-new -fuse-ld=lld  -flto=thin -fno-inline a.cpp main.cpp  -mllvm --memprof-ave-lifetime-cold-threshold=5  -c
  1. Thin link
/data00/lifengxiang.1025/llvm_newest/build/bin/clang++ -fuse-ld=lld -flto=thin -Wl,-plugin-opt,thinlto-index-only=thinlto-param-file -Wl,-plugin-opt,thinlto-emit-imports-files a.o main.o 

After thin link, I can see the information in function summary about memprof.
a.o.ll:

^2 = gv: (name: "_Z1EPPcS0_", summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, canAutoHide: 0), insts: 5, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 1, alwaysInline: 0, noUnwind: 0, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), calls: ((callee: ^5)), allocs: ((versions: (none), memProf: ((type: notcold, stackIds: (2221125039561139392)), (type: notcold, stackIds: (11541504054835133535)), (type: cold, stackIds: (17775019957889038416)))), (versions: (none), memProf: ((type: notcold, stackIds: (2221125039561139392)), (type: cold, stackIds: (11541504054835133535)), (type: notcold, stackIds: (17775019957889038416)))))))) ; guid = 4815915868504849163

a.o.thinlto.ll:

^2 = gv: (guid: 4815915868504849163, summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 0, live: 1, dsoLocal: 1, canAutoHide: 0), insts: 5, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 1, alwaysInline: 0, noUnwind: 0, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), allocs: ((versions: (none), memProf: ((type: notcold, stackIds: (2221125039561139392)), (type: notcold, stackIds: (11541504054835133535)), (type: cold, stackIds: (17775019957889038416)))), (versions: (none), memProf: ((type: notcold, stackIds: (2221125039561139392)), (type: cold, stackIds: (11541504054835133535)), (type: notcold, stackIds: (17775019957889038416))))))))
  1. Thin backend
/data00/lifengxiang.1025/llvm_newest/build/bin/clang++ -O1 -gmlt -fdebug-info-for-profiling  -mllvm -enable-memprof-context-disambiguation -mllvm -supports-hot-cold-new -mllvm -optimize-hot-cold-new -mllvm --memprof-ave-lifetime-cold-threshold=5 -x ir a.o -fthinlto-index=a.o.thinlto.bc -c -o a.o.o

After thin backend, the output a.o.o doesn’t replace new. I found in LibCallSimplifier::optimizeNew it always returns nullptr (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L1731C1-L1732C1) by debugging.

nm a.o.o
0000000000000030 T _Z1BPPcS0_
0000000000000040 T _Z1CPPcS0_
0000000000000050 T _Z1DPPcS0_
0000000000000000 t _Z1EPPcS0_
                 U _Znam