RFC: EfficiencySanitizer Cache Fragmentation tool

Please reference the prior RFC on EfficiencySanitizer. This is one of the performance analysis tools we would like to build under the EfficiencySanitizer umbrella.

From: "Qin Zhao via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, April 22, 2016 4:59:00 PM
Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

Please reference the prior RFC on EfficiencySanitizer. This is one of
the performance analysis tools we would like to build under the
EfficiencySanitizer umbrella.

====================
Motivation

An application is running sub-optimally if only part of the data
brought into the cache is used, which we call cache fragmentation.
Knowing the cache fragmentation information during a given
application's execution helps developers to understand the
application’s cache behavior and how best to direct performance
optimization efforts. For example, developers may reorder the struct
fields based on the cache fragmentation information and hopefully
improve cache hit ration and performance.

====================
Approach

We focus on two ways to get cache fragmentation information:

1.
Struct field access patterns.
2.
Heap/global object access patterns.

=====================
Struct field access patterns

1.
Get all the struct type information (e.g., via
getIdentifiedStructTypes()), and create a counter for each field of
each struct.
2.
Instrument each GEP (GetElementPtr) instruction if it operates on a
struct type to update the corresponding field counter.
3.
At the program exit, filter and sort the struct field reference
counters, and print the struct field hotness information for those
structs deemed most likely to affect performance. The
sorting/filtering metric could include disparity between fields: hot
fields interleaved with cold fields, with a total access count high
enough to matter.

There are a few potential problems with this simple approach:

*
Overcount: a GEP instruction does not necessarily mean a field
access.
*
Undercount: a GEP instruction may lead to multiple field accesses,
especially if the address is passed to another function as an
argument.

I can't tell from your description, but it sounds like you might also undercount accesses from escaped addresses (because these are later indistinguishable from heap accesses)?

*

*
Racy update by multiple threads.

We want to keep the instrumentation simple in our initial
implementation for both robustness and performance reasons, so we
will defer any analysis (e.g., data flow analysis) to later stages.
Any suggestions on how to improve the accuracy are welcome.

I don't understand why you're using a separate mechanism here from what is being used for heap accesses. Why don't you use the same shadow-memory scheme here as you do for heap accesses (especially considering that escaped stack addresses will be counted this way anyway), and then upon function exit, collect the counts from the local stack? I think the necessary region is:

[@llvm.frameaddress(0), @llvm.frameaddress(0) + @llvm.get.dynamic.area.offset())

or you can call @llvm.stacksave() upon entry and use that as the base offset.

-Hal

From: "Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Qin Zhao" <zhaoqin@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, April 22, 2016 7:13:38 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
tool

> From: "Qin Zhao via llvm-dev" <llvm-dev@lists.llvm.org>

> To: "llvm-dev" <llvm-dev@lists.llvm.org>

> Sent: Friday, April 22, 2016 4:59:00 PM

> Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
> tool

> Please reference the prior RFC on EfficiencySanitizer. This is one
> of
> the performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.

> ====================

> Motivation

> ====================

> An application is running sub-optimally if only part of the data
> brought into the cache is used, which we call cache fragmentation.
> Knowing the cache fragmentation information during a given
> application's execution helps developers to understand the
> application’s cache behavior and how best to direct performance
> optimization efforts. For example, developers may reorder the
> struct
> fields based on the cache fragmentation information and hopefully
> improve cache hit ration and performance.

> ====================

> Approach

> ====================

> We focus on two ways to get cache fragmentation information:

> 1.

> Struct field access patterns.

> 2.

> Heap/global object access patterns.

> =====================

> Struct field access patterns

> =====================

> 1.

> Get all the struct type information (e.g., via
> getIdentifiedStructTypes()), and create a counter for each field of
> each struct.

> 2.

> Instrument each GEP (GetElementPtr) instruction if it operates on a
> struct type to update the corresponding field counter.

> 3.

> At the program exit, filter and sort the struct field reference
> counters, and print the struct field hotness information for those
> structs deemed most likely to affect performance. The
> sorting/filtering metric could include disparity between fields:
> hot
> fields interleaved with cold fields, with a total access count high
> enough to matter.

> There are a few potential problems with this simple approach:

> *

> Overcount: a GEP instruction does not necessarily mean a field
> access.

> *

> Undercount: a GEP instruction may lead to multiple field accesses,
> especially if the address is passed to another function as an
> argument.

You can ignore my message; I misread the distinction here.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Qin Zhao"
<zhaoqin@google.com>
Sent: Friday, April 22, 2016 7:18:03 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
tool

> From: "Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>

> To: "Qin Zhao" <zhaoqin@google.com>

> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>

> Sent: Friday, April 22, 2016 7:13:38 PM

> Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> Fragmentation
> tool

> > From: "Qin Zhao via llvm-dev" <llvm-dev@lists.llvm.org>
>

> > To: "llvm-dev" <llvm-dev@lists.llvm.org>
>

> > Sent: Friday, April 22, 2016 4:59:00 PM
>

> > Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
> > tool
>

> > Please reference the prior RFC on EfficiencySanitizer. This is
> > one
> > of
> > the performance analysis tools we would like to build under the
> > EfficiencySanitizer umbrella.
>

> > ====================
>

> > Motivation
>

> > ====================
>

> > An application is running sub-optimally if only part of the data
> > brought into the cache is used, which we call cache
> > fragmentation.
> > Knowing the cache fragmentation information during a given
> > application's execution helps developers to understand the
> > application’s cache behavior and how best to direct performance
> > optimization efforts. For example, developers may reorder the
> > struct
> > fields based on the cache fragmentation information and hopefully
> > improve cache hit ration and performance.
>

> > ====================
>

> > Approach
>

> > ====================
>

> > We focus on two ways to get cache fragmentation information:
>

> > 1.
>

> > Struct field access patterns.
>

> > 2.
>

> > Heap/global object access patterns.
>

> > =====================
>

> > Struct field access patterns
>

> > =====================
>

> > 1.
>

> > Get all the struct type information (e.g., via
> > getIdentifiedStructTypes()), and create a counter for each field
> > of
> > each struct.
>

> > 2.
>

> > Instrument each GEP (GetElementPtr) instruction if it operates on
> > a
> > struct type to update the corresponding field counter.
>

> > 3.
>

> > At the program exit, filter and sort the struct field reference
> > counters, and print the struct field hotness information for
> > those
> > structs deemed most likely to affect performance. The
> > sorting/filtering metric could include disparity between fields:
> > hot
> > fields interleaved with cold fields, with a total access count
> > high
> > enough to matter.
>

> > There are a few potential problems with this simple approach:
>

> > *
>

> > Overcount: a GEP instruction does not necessarily mean a field
> > access.
>

> > *
>

> > Undercount: a GEP instruction may lead to multiple field
> > accesses,
> > especially if the address is passed to another function as an
> > argument.
>

I changed my mind; don't ignore what I said. But let me propose a slightly-different design:

1. Just as you've described, collect counts in shadow memory for memory accesses.
2. For each relevant type, emit a function that knows how to collect counts from an array of that type into the struct-field counters .
3. When memory is freed, call the appropriate function to perform this accumulation.
4. If desired, also collect counts from stack memory (as I indicated below, although we'll need to call all of the relevant summation functions, and this could get expensive if they're not inlined).

In this way, we'll have no over-counting / under-counting problems; plus this is easily extensible to collecting per-allocation struct-field access stats (which is often desired). We can get everything at the same time, and might actually be less expensive when running on multiple cores (because they'll be less cache-line contention for the struct-field counters).

Thanks again,
Hal

Thanks for the comment and suggestions. That’s a great idea!

We actually thought about using each heap object with its type information for more accurate data, and it is definitely in our future plan.
However, there are challenges to do so.
For example, getting correct type information for each heap object might not be easy, especially for C programs. An application can even use a custom allocator and make things even harder.
Please let me know if there is a way to get correct type information for heap object.
There are also other issues like how to count union of two struct types.

I want to collect good enough information with minimized overhead, so I need evaluate overhead and quality of the profile for each approach.

Maybe the simple field counting approach is good enough. We need implement and evaluate.

– Qin

From: "Qin Zhao" <zhaoqin@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Derek Bruening" <bruening@google.com>, efficiency-sanitizer@google.com
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, April 22, 2016 10:26:32 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

Thanks for the comment and suggestions. That's a great idea!

We actually thought about using each heap object with its type
information for more accurate data, and it is definitely in our
future plan.
However, there are challenges to do so.
For example, getting correct type information for each heap object
might not be easy, especially for C programs. An application can
even use a custom allocator and make things even harder.
Please let me know if there is a way to get correct type information
for heap object.
There are also other issues like how to count union of two struct
types.

I want to collect good enough information with minimized overhead, so
I need evaluate overhead and quality of the profile for each
approach.
Maybe the simple field counting approach is good enough. We need
implement and evaluate.

Good point: At the deallocation site you generally can't see the loads, so you don't have the type information there.

I think the best option here is to put this part of the lowering in the frontend. The frontend can see the pointer type where the delete operator is called, and can generate the necessary 'accumulation' function based on that. This will give a higher-quality result than using IR-level structure fields for two reasons: 1) The frontend knows how to ignore padding 2) The frontend knows the proper names of the fields. The frontend can also deal in the a reasonable way with unions (although I doubt you'll ever get a great solution here).

Thanks again,
Hal

From: "Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Qin Zhao" <zhaoqin@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, efficiency-sanitizer@google.com
Sent: Friday, April 22, 2016 10:43:44 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

> From: "Qin Zhao" <zhaoqin@google.com>
> To: "Hal Finkel" <hfinkel@anl.gov>, "Derek Bruening"
> <bruening@google.com>, efficiency-sanitizer@google.com
> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
> Sent: Friday, April 22, 2016 10:26:32 PM
> Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> Fragmentation tool
>
>
> Thanks for the comment and suggestions. That's a great idea!
>
>
> We actually thought about using each heap object with its type
> information for more accurate data, and it is definitely in our
> future plan.
> However, there are challenges to do so.
> For example, getting correct type information for each heap object
> might not be easy, especially for C programs. An application can
> even use a custom allocator and make things even harder.
> Please let me know if there is a way to get correct type
> information
> for heap object.
> There are also other issues like how to count union of two struct
> types.
>
>
> I want to collect good enough information with minimized overhead,
> so
> I need evaluate overhead and quality of the profile for each
> approach.
> Maybe the simple field counting approach is good enough. We need
> implement and evaluate.

Good point: At the deallocation site you generally can't see the
loads, so you don't have the type information there.

I think the best option here is to put this part of the lowering in
the frontend. The frontend can see the pointer type where the delete
operator is called, and can generate the necessary 'accumulation'
function based on that. This will give a higher-quality result than
using IR-level structure fields for two reasons: 1) The frontend
knows how to ignore padding 2) The frontend knows the proper names
of the fields. The frontend can also deal in the a reasonable way
with unions (although I doubt you'll ever get a great solution
here).

To reiterate a point you made, however, this does not handle cases where a custom allocator creates space for some 'tail structure'. It is worse than that, however, because it does not do the useful thing for virtual base pointers in general. This probably does not matter so much for many of my use cases (where we have arrays of structures or basic types), but in some environments could be a significant issue. I believe that the virtual-base-pointer case is solvable by using RTTI (or some similar mechanism). I'm not yet sure how tail structures could be handled under this scheme.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, efficiency-sanitizer@google.com, "Qin Zhao" <zhaoqin@google.com>
Sent: Friday, April 22, 2016 11:53:24 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

> From: "Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
> To: "Qin Zhao" <zhaoqin@google.com>
> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>,
> efficiency-sanitizer@google.com
> Sent: Friday, April 22, 2016 10:43:44 PM
> Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> Fragmentation tool
>
> > From: "Qin Zhao" <zhaoqin@google.com>
> > To: "Hal Finkel" <hfinkel@anl.gov>, "Derek Bruening"
> > <bruening@google.com>, efficiency-sanitizer@google.com
> > Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
> > Sent: Friday, April 22, 2016 10:26:32 PM
> > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> > Fragmentation tool
> >
> >
> > Thanks for the comment and suggestions. That's a great idea!
> >
> >
> > We actually thought about using each heap object with its type
> > information for more accurate data, and it is definitely in our
> > future plan.
> > However, there are challenges to do so.
> > For example, getting correct type information for each heap
> > object
> > might not be easy, especially for C programs. An application can
> > even use a custom allocator and make things even harder.
> > Please let me know if there is a way to get correct type
> > information
> > for heap object.
> > There are also other issues like how to count union of two struct
> > types.
> >
> >
> > I want to collect good enough information with minimized
> > overhead,
> > so
> > I need evaluate overhead and quality of the profile for each
> > approach.
> > Maybe the simple field counting approach is good enough. We need
> > implement and evaluate.
>
> Good point: At the deallocation site you generally can't see the
> loads, so you don't have the type information there.
>
> I think the best option here is to put this part of the lowering in
> the frontend. The frontend can see the pointer type where the
> delete
> operator is called, and can generate the necessary 'accumulation'
> function based on that. This will give a higher-quality result than
> using IR-level structure fields for two reasons: 1) The frontend
> knows how to ignore padding 2) The frontend knows the proper names
> of the fields. The frontend can also deal in the a reasonable way
> with unions (although I doubt you'll ever get a great solution
> here).

To reiterate a point you made, however, this does not handle cases
where a custom allocator creates space for some 'tail structure'. It
is worse than that, however, because it does not do the useful thing
for virtual base pointers in general. This probably does not matter
so much for many of my use cases (where we have arrays of structures
or basic types), but in some environments could be a significant
issue. I believe that the virtual-base-pointer case is solvable by
using RTTI (or some similar mechanism).

There's a possibly-elegant way to handle this problem: For classes with a virtual destructor, the code to sum the shadow-memory counts into field counts goes in the destructor. Otherwise, it gets injected in the call to the delete operator (or any explicit destructor call, which should catch a lot of the 'tail structure' cases too).

-Hal