RFC: EfficiencySanitizer

It would be great to automatically apply the results of the tools, but we
do not think that this is straightforward for enough cases up front. For
the cache fragmentation tool, automatically applying data structure field
reordering (or splitting or peeling) generally requires whole-program
compilation, which is not always available and currently does not scale up
to the size of applications we would like to target. The working set tool
is not a candidate for automated action. Acting on dead stores typically
requires programmer analysis to confirm that there is not some non-executed
path where the store is not actually dead.

We would like to start with a standalone sanitizer usage model providing
developer feedback. How about if we start with that and in the future we
can revisit whether some subset of the results can be acted on in an
automated fashion?

Is there a good place to discuss potential ideas for this? I don't just
want to add more to the end of this thread if you think it'll get unwieldy.
Perhaps bugzilla feature requests or something like that?

I created a container feature request for ideas:
https://llvm.org/bugs/show_bug.cgi?id=27438

And +1 to the request for more information on the kinds of things you are

planning to write. Would be great to know what you have in mind and then
discuss other ideas too.

I sent out a separate RFC on the working set tool and we'll send one for
the other tools as well.

Perfect. Thanks!

Thank you for the ideas. I've added them to
https://llvm.org/bugs/show_bug.cgi?id=27438

The case studies in the DeadSpy paper show how they, the user, fixed the
most frequent cases of dead stores in SPECCPU (yes, in some cases working
around the compiler, but not always): "DeadSpy: A Tool to Pinpoint Program
Inefficiencies"
<https://www.researchgate.net/publication/241623127_DeadSpy_A_tool_to_pinpoint_program_inefficiencies&gt;
by
Chabbi et al. from CGO 2012.

Some of this data might be interesting for profile guidance. Are there any
plans there?

Esan instrumentation is geared toward application level tuning by
developers -- the data collected here are not quite 'actionable' by the
compiler directly. For instance, struct field reordering needs whole
program analysis and can be very tricky to do for C++ code with complex
inheritances (e.g, best base class field order may depending on the
context of the inheritance). Fancy struct layout changes such as peeling,
splitting/outlining, field inlining etc also requires very good address
escape analysis etc. Dead store detection can be used indirectly by the
compiler -- compiler certainly can not use the information to prove
statically the stores are dead, but the compiler developer can use this
tool to find the cases and figure out missing optimizations in the
compiler. Data working set size data is also not quite usable by the
compiler.

Esan's design tradeoffs are also not the same as PGO. The former allows
more overhead and is less restricted -- it can go deeper.

David

Interesting idea! I understand how the bookkeeping in the tool is similar
to some of the sanitizers but I am wondering whether that is really the
best developer’s work-flow for such a tool.

I could imagine that some of the opportunities discovered by the tool
could be optimized automatically by the compiler (e.g. temporal loads, sw
prefetching, partitioning the heap) so feeding this information back to the
compiler could be highly useful. I am wondering whether the PGO model is
closer to what we want at the end. The problem can also be thought of as a
natural extension of PGO. Besides instrumenting branches and indirect
calls, it adds instrumentation for loads and stores.

We have internally been discussing ways to use PGO for optimization
diagnostics (a continuation of Tyler’s work, see
Loop Vectorization: Diagnostics and Control - The LLVM Project Blog).
The idea is to help the developer to focus in on opportunities in hot code.
It seems that the diagnostics provided by your tools could be emitted
directly by the analyses in LLVM during the profile-use phase.

You are right that some of the information are already available with PGO +
static analysis -- one example is field affinity data. Instruction working
set data is also 'roughly' available.

However I think Esan can potentially do a better job, easier to use and be
a centralized place for getting such information.

David

The compiler can also use the dead store information directly. If you look at the dead stores in the hmmer in the DeadSpy paper (section 5.2), we now get this with LLVM after http://reviews.llvm.org/D16712 because the loop is multi-versioned to disambiguate may-aliasing accesses. Essentially anytime we ask the user to add restrict we have a chance to do that automatically with versioning given enough confidence based on the profile information.

Well but that can just be different profiling levels that the user specifies upfront.

Adam

Some of this data might be interesting for profile guidance. Are there
any plans there?

Esan instrumentation is geared toward application level tuning by
developers -- the data collected here are not quite 'actionable' by the
compiler directly. For instance, struct field reordering needs whole
program analysis and can be very tricky to do for C++ code with complex
inheritances (e.g, best base class field order may depending on the
context of the inheritance). Fancy struct layout changes such as peeling,
splitting/outlining, field inlining etc also requires very good address
escape analysis etc. Dead store detection can be used indirectly by the
compiler -- compiler certainly can not use the information to prove
statically the stores are dead, but the compiler developer can use this
tool to find the cases and figure out missing optimizations in the compiler.

The compiler can also use the dead store information directly. If you look
at the dead stores in the hmmer in the DeadSpy paper (section 5.2), we now
get this with LLVM after ⚙ D16712 [LoopVersioning] Annotate versioned loop with noalias metadata because the loop
is multi-versioned to disambiguate may-aliasing accesses. Essentially
anytime we ask the user to add restrict we have a chance to do that
automatically with versioning given enough confidence based on the profile
information.

By using 'directly' -- it should mean that the compiler takes the feedback
data, and directly eliminate the found 'dead' store. Compiler changes or
user annotations still needed in order for the compiler to prove the store
is dead statically. That is why I said it is used 'indirectly' by the
compiler (developers).

  Data working set size data is also not quite usable by the compiler.

Esan's design tradeoffs are also not the same as PGO. The former allows
more overhead and is less restricted -- it can go deeper.

Well but that can just be different profiling levels that the user
specifies upfront.

By PGO, what I meant is the default PGO mode. The key point is that Esan
does not have to follow the same design considerations.

David

Given that you're going to store metadata for 4-64 cache lines of user
memory in a single shadow memory cache line, do you anticipate
additional slowdown related to false sharing of shadow memory in
multithreaded applications? Any plans to deal with that?

The optimization of check-before-write would avoid most writes and so
alleviate the false sharing problem.

> The "external DSO" case also exists, but unless the DSO is also
> instrumented, you'd get lots of false-negatives (which aren't "a big
> problem" with the sanitizers, but of course we want to minimize them
> (you'd do it by also instrumenting the DSO)).

We could also perform hybrid execution, i.e., execute the compiler
instrumented code natively and execute external DSO code under a dynamic
binary instrumentation system like DynamoRIO <http://www.dynamorio.org>.
That would be a totally different topic.

On 17 April 2016 at 22:46, Derek Bruening via llvm-dev
> Studying instruction cache behavior with compiler
> instrumentation can be challenging, however, so we plan to at least
> initially focus on data performance.

I'm interested in how you're going to do this without kernel profiling
probes, like perf.

Or is the point here introducing syscalls in the right places instead
of randomly profiled? Wouldn't that bias your results?

I'm not sure I understand the question: are you asking whether not
gathering data on time spent in the kernel is an issue? Or you're asking
how to measure aspects of performance without using sampling or hardware
performance counters?

Many of our planned tools target specific performance issues with data
> accesses. They employ the technique of *shadow memory* to store metadata
> about application data references, using the compiler to instrument loads
> and stores with code to update the shadow memory.

Is it just counting the number of reads/writes? Or are you going to
add how many of those accesses were hit by a cache miss?

It varies by tool. The brief descriptions in the original email hopefully
shed some light; we are also sending separate RFC's for each tool (working
set was already sent). The cache frag tool is basically just counting,
yes. There is no cache miss information here: we are not using hardware
perf counters nor running a software cache simulation. We are measuring
particular aspects of application behavior that tend to affect performance,
often abstracted away from the precise microarchitecture you're running on.

*Cache fragmentation*: this tool gather data structure field hotness
> information, looking for data layout optimization opportunities by
grouping
> hot fields together to avoid data cache fragmentation. Future
enhancements
> may add field affinity information if it can be computed with low enough
> overhead.

Would be also good to have temporal information, so that you can
correlate data access that occurs, for example, inside the same loop /
basic block, or in sequence in the common CFG flow. This could lead to
change in allocation patterns (heap, BSS).

Agreed, we have thought about adding temporal information, though it would
cost more and we have not flushed out the details.

> *Working set measurement*: this tool measures the data working set size
of
> an application at each snapshot during execution. It can help to
understand
> phased behavior as well as providing basic direction for further effort
by
> the developer: e.g., knowing whether the working set is close to fitting
in
> current L3 caches or is many times larger can help determine where to
spend
> effort.

This is interesting, but most useful when your dataset changes size
over different runs. This is similar to running the program under perf
for different workloads, and I'm not sure how you're going to get that
in a single run. It also comes with the additional problem that cache
sizes are not always advertised, so you might have an additional tool
to guess the sizes based on increasing the size of data blocks and
finding steps on the data access graph.

This tool is relatively agnostic of the precise details of the caches
beyond having its granularity based on the cache line size it assumes (64
bytes, can be parametrized).

From: "Derek Bruening via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Cc: efficiency-sanitizer@google.com
Sent: Sunday, April 17, 2016 4:46:29 PM
Subject: [llvm-dev] RFC: EfficiencySanitizer

TL;DR: We plan to build a suite of compiler-based dynamic
instrumentation tools for analyzing targeted performance problems.
These tools will all live under a new "EfficiencySanitizer" (or
"esan") sanitizer umbrella, as they will share significant portions
of their implementations.

====================
Motivation

Our goal is to build a suite of dynamic instrumentation tools for
analyzing particular performance problems that are difficult to
evaluate using other profiling methods. Modern hardware performance
counters provide insight into where time is spent and when
micro-architectural events such as cache misses are occurring, but
they are of limited effectiveness for contextual analysis: it is not
easy to answer *why* a cache miss occurred.

Examples of tools that we have planned include: identifying wasted or
redundant computation, identifying cache fragmentation, and
measuring working sets. See more details on these below.

====================
Approach

We believe that tools with overhead beyond about 5x are simply too
heavyweight to easily apply to large, industrial-sized applications
running real-world workloads. Our goal is for our tools to gather
useful information with overhead less than 5x, and ideally closer to
3x, to facilitate deployment. We would prefer to trade off accuracy
and build a less-accurate tool below our overhead ceiling than to
build a high-accuracy but slow tool. We hope to hit a sweet spot of
tools that gather trace-based contextual information not feasible
with pure sampling yet are still practical to deploy.

In a similar vein, we would prefer a targeted tool that analyzes one
particular aspect of performance with low overhead than a more
general tool that can answer more questions but has high overhead.

Dynamic binary instrumentation is one option for these types of
tools, but typically compiler-based instrumentation provides better
performance, and we intend to focus only on analyzing applications
for which source code is available. Studying instruction cache
behavior with compiler instrumentation can be challenging, however,
so we plan to at least initially focus on data performance.

Many of our planned tools target specific performance issues with
data accesses. They employ the technique of *shadow memory* to store
metadata about application data references, using the compiler to
instrument loads and stores with code to update the shadow memory. A
companion runtime library intercepts libc calls if necessary to
update shadow memory on non-application data references. The runtime
library also intercepts heap allocations and other key events in
order to perform its analyses. This is all very similar to how
existing sanitizers such as AddressSanitizer, ThreadSanitizer,
MemorySanitizer, etc. operate today.

====================
Example Tools

We have several initial tools that we plan to build. These are not
necessarily novel ideas on their own: some of these have already
been explored in academia. The idea is to create practical,
low-overhead, robust, and publicly available versions of these
tools.

*Cache fragmentation*: this tool gather data structure field hotness
information, looking for data layout optimization opportunities by
grouping hot fields together to avoid data cache fragmentation.
Future enhancements may add field affinity information if it can be
computed with low enough overhead.

*Working set measurement*: this tool measures the data working set
size of an application at each snapshot during execution. It can
help to understand phased behavior as well as providing basic
direction for further effort by the developer: e.g., knowing whether
the working set is close to fitting in current L3 caches or is many
times larger can help determine where to spend effort.

Will this technology allow us to pinpoint specific accesses that generally have high latency (i.e. generally are cache misses)? This information is useful for programmers, and is also useful as an input to loop unrolling, instruction scheduling, and the like on ooo cores.

-Hal

Not when dead store happens in an external DSO where compiler can't detect
it (same applies for single references).

Do you mean the ones between the DSO and the instrumented code?
Because if it's just in the DSO itself, then the compiler could have
spotted it, too, when compiling the DSO.

Sure, I meant inter-DSO.

I mean, of course there are cases line interprocedural dead-store
(call a function that changes A while changing A right after), but
that again, could be found at compilation time, given enough inlining
depth or IP analysis.

When I read the description, I assumed it would (mostly) be used to
detect those inter-procedural dead-stores, that the compiler can't see
(without LTO, at least).
The "external DSO" case also exists, but unless the DSO is also
instrumented, you'd get lots of false-negatives (which aren't "a big
problem" with the sanitizers, but of course we want to minimize them
(you'd do it by also instrumenting the DSO)).

Good point, one (critical) requirement for easy integration is scalability - user should not need to recompile full distro to get useful info from the tool (not that it's always possible of course). For example this is (IMHO) the main obstacle for using MSan in production.

Also, if this is not something the compiler can fix, what is the point
of detecting dead-stores? For all the non-trivial cases the compiler
can't spot, most will probably arise from special situations where the
compiler is changing the code to expose the issue, and thus, the user
has little control over how to fix the underlying problem.

Same as the other sanitizers: The compiler can't fix, but you (the
programmer) can! :slight_smile:

Indeed. Performance engineers (who will be very interested in this tool) typically have the luxury of optimizing the code across packages.

I'm not sure I understand the question: are you asking whether not gathering
data on time spent in the kernel is an issue? Or you're asking how to
measure aspects of performance without using sampling or hardware
performance counters?

The latter. I mean, counting hits with mcount is easy, but other
information like cache miss needs kernel support, no?

There is no cache miss information here: we are not using hardware perf
counters nor running a software cache simulation. We are measuring
particular aspects of application behavior that tend to affect performance,
often abstracted away from the precise microarchitecture you're running on.

Ah, right, I think that explains my confusion.

cheers,
--renato

Will this technology allow us to pinpoint specific accesses that generally
have high latency (i.e. generally are cache misses)? This information is
useful for programmers, and is also useful as an input to loop unrolling,
instruction scheduling, and the like on ooo cores.

Won't hardware performance counter tell you which accesses are delinquent
accesses?
The tools here are trying to provide more information and hopefully some
insight of the performance problem.
For example, if we can tell that the working set of an application is
slightly bigger than the L3 cache, developers would be able to take the
right action to improve the performance.

We are welcome any suggestions about how the information can be used, or
what information should be collected.

Qin

From: "Qin Zhao" <zhaoqin@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Derek Bruening" <bruening@google.com>,
efficiency-sanitizer@google.com, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, April 21, 2016 12:48:20 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer

> Will this technology allow us to pinpoint specific accesses that
> generally have high latency (i.e. generally are cache misses)? This
> information is useful for programmers, and is also useful as an
> input to loop unrolling, instruction scheduling, and the like on
> ooo
> cores.

Won't hardware performance counter tell you which accesses are
delinquent accesses?

Yes, for the particular system on which you're running. The question is, as the (effective) cache size changes (either because you're running more threads per core, socket, etc. or as you run across different architectures), can you say where the cache-miss hot-spots will be?

-Hal

------------------------------

*From: *"Qin Zhao" <zhaoqin@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Derek Bruening" <bruening@google.com>,
efficiency-sanitizer@google.com, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Thursday, April 21, 2016 12:48:20 PM
*Subject: *Re: [llvm-dev] RFC: EfficiencySanitizer

Will this technology allow us to pinpoint specific accesses that
generally have high latency (i.e. generally are cache misses)? This
information is useful for programmers, and is also useful as an input to
loop unrolling, instruction scheduling, and the like on ooo cores.

Won't hardware performance counter tell you which accesses are delinquent
accesses?

Yes, for the particular system on which you're running. The question is,
as the (effective) cache size changes (either because you're running more
threads per core, socket, etc. or as you run across different
architectures), can you say where the cache-miss hot-spots will be?

-Hal

There are many reasons that may cause the cache misses. Since we are not
doing a full cache simulation, we won't be to report the cache-miss for all
cases.
However, it is possible in certain cases.
Because we could keep up to 8 snapshot of the working set (xref the working
set tool RFC), we could easily identify the memory reference that touches a
cache line that were not touched in the previous snapshot. Such references
has a high chance to cause cache misses.

The tools here are trying to provide more information and hopefully some

From: "Qin Zhao" <zhaoqin@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Derek Bruening" <bruening@google.com>,
efficiency-sanitizer@google.com, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, April 21, 2016 1:20:40 PM
Subject: Re: [llvm-dev] RFC: EfficiencySanitizer

> > From: "Qin Zhao" < zhaoqin@google.com >
>

> > To: "Hal Finkel" < hfinkel@anl.gov >
>

> > Cc: "Derek Bruening" < bruening@google.com >,
> > efficiency-sanitizer@google.com , "llvm-dev" <
> > llvm-dev@lists.llvm.org >
>

> > Sent: Thursday, April 21, 2016 12:48:20 PM
>

> > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer
>

> > > Will this technology allow us to pinpoint specific accesses
> > > that
> > > generally have high latency (i.e. generally are cache misses)?
> > > This
> > > information is useful for programmers, and is also useful as an
> > > input to loop unrolling, instruction scheduling, and the like
> > > on
> > > ooo
> > > cores.
> >
>

> > Won't hardware performance counter tell you which accesses are
> > delinquent accesses?
>

> Yes, for the particular system on which you're running. The
> question
> is, as the (effective) cache size changes (either because you're
> running more threads per core, socket, etc. or as you run across
> different architectures), can you say where the cache-miss
> hot-spots
> will be?

> -Hal

There are many reasons that may cause the cache misses. Since we are
not doing a full cache simulation, we won't be to report the
cache-miss for all cases.
However, it is possible in certain cases.
Because we could keep up to 8 snapshot of the working set (xref the
working set tool RFC), we could easily identify the memory reference
that touches a cache line that were not touched in the previous
snapshot. Such references has a high chance to cause cache misses.

Interesting, thanks! That sounds potentially useful.

-Hal

------------------------------

*From: *"Qin Zhao" <zhaoqin@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Derek Bruening" <bruening@google.com>,
efficiency-sanitizer@google.com, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Thursday, April 21, 2016 1:20:40 PM
*Subject: *Re: [llvm-dev] RFC: EfficiencySanitizer

------------------------------

*From: *"Qin Zhao" <zhaoqin@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Derek Bruening" <bruening@google.com>,
efficiency-sanitizer@google.com, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Thursday, April 21, 2016 12:48:20 PM
*Subject: *Re: [llvm-dev] RFC: EfficiencySanitizer

Will this technology allow us to pinpoint specific accesses that
generally have high latency (i.e. generally are cache misses)? This
information is useful for programmers, and is also useful as an input to
loop unrolling, instruction scheduling, and the like on ooo cores.

Won't hardware performance counter tell you which accesses are delinquent
accesses?

Yes, for the particular system on which you're running. The question is,
as the (effective) cache size changes (either because you're running more
threads per core, socket, etc. or as you run across different
architectures), can you say where the cache-miss hot-spots will be?

-Hal

There are many reasons that may cause the cache misses. Since we are not
doing a full cache simulation, we won't be to report the cache-miss for all
cases.
However, it is possible in certain cases.
Because we could keep up to 8 snapshot of the working set (xref the
working set tool RFC), we could easily identify the memory reference that
touches a cache line that were not touched in the previous snapshot. Such
references has a high chance to cause cache misses.

Interesting, thanks! That sounds potentially useful.

The problem with using HW performance counters for this kind of problem is
that the reported 'cache miss hotspots' are sometimes the symptom, not the
root cause -- the reported accesses are likely to be innocent and are the
victims of the real problems elsewhere. For instance a large write that
pollutes the cache if not marked as non-temporal. Such issues needs to be
looked at globally with collective evidence.

David