TL;DR: We plan to build a suite of compiler-based dynamic instrumentation tools for analyzing targeted performance problems. These tools will all live under a new “EfficiencySanitizer” (or “esan”) sanitizer umbrella, as they will share significant portions of their implementations.
Have you consider naming it performance sanitizer
instead? I believe that it would be easier to misheard esan with asan, and psan would solve it.
Besides, looks like fun! Good luck
Piotr
Have you consider naming it `performance sanitizer` instead? I believe
that it would be easier to misheard esan with asan, and psan would solve
it.
I call that E/ē/san, which I feel is far enough from A/eɪ/san.
Instead it may be misheard as T/tē/san, but P/pē/san won't help much there
either.
This sounds interesting. I’ve got a couple of questions about the cache fragmentation tool and the working set measurement tool.
I can imagine vaguely imagine how this data would be acquired, but I’m more interested in what analysis is provided by the tool, and how this information would be presented to a user. Would it be a flat list of classes, sorted by number of accesses, with each field annotated by number of accesses? Or is there some other kind of presentation planned? Maybe some kind of weighting for classes with frequent cache misses? I think my questions here are basically the reverse of my prior questions. I can imagine the presentation ( a graph with time on the X axis, working set measurement on the Y axis, with some markers highlighting key execution points). I’m not sure how the data collection works though, or even really what is being measured. Are you planning on counting the number of data bytes / data cache lines used during each time period? For the purposes of this tool, when is data brought into the working set and when is data evicted from the working set?
*Cache fragmentation*: this tool gather data structure field hotness
information, looking for data layout optimization opportunities by grouping
hot fields together to avoid data cache fragmentation. Future enhancements
may add field affinity information if it can be computed with low enough
overhead.I can imagine vaguely imagine how this data would be acquired, but I'm
more interested in what analysis is provided by the tool, and how this
information would be presented to a user. Would it be a flat list of
classes, sorted by number of accesses, with each field annotated by number
of accesses? Or is there some other kind of presentation planned? Maybe
some kind of weighting for classes with frequent cache misses?
The sorting/filtering metric will include disparity between fields: hot
fields interleaved with cold fields are what it's looking for, with a total
access count high enough to matter. Yes, it would present to the user the
field layout with annotations for access count as you suggest.
*Working set measurement*: this tool measures the data working set size of
an application at each snapshot during execution. It can help to
understand phased behavior as well as providing basic direction for further
effort by the developer: e.g., knowing whether the working set is close to
fitting in current L3 caches or is many times larger can help determine
where to spend effort.I think my questions here are basically the reverse of my prior
questions. I can imagine the presentation ( a graph with time on the X
axis, working set measurement on the Y axis, with some markers highlighting
key execution points). I'm not sure how the data collection works though,
or even really what is being measured. Are you planning on counting the
number of data bytes / data cache lines used during each time period? For
the purposes of this tool, when is data brought into the working set and
when is data evicted from the working set?
The tool records which data cache lines were touched at least once during a
snapshot (basically just setting a shadow memory bit for each load/store).
The metadata is cleared after each snapshot is recorded so that the next
snapshot starts with a blank slate. Snapshots can be combined via logical
or as the execution time grows to adaptively handle varying total execution
time.
Have you consider naming it `performance sanitizer` instead? I
believe that it would be easier to misheard esan with asan, and psan
would solve it.I call that E/ē/san, which I feel is far enough from A/eɪ/san.
Instead it may be misheard as T/tē/san, but P/pē/san won't help much
there either.
Spit-balling: how about qsan, as in "quickness sanitizer", pronounced: queue-san.
So I guess a visualization of that information could show how much new memory was referenced per snapshot, how much was the same from a prior snapshot, and how much was “dropped”. Neat.
Some other things that might be useful to look for (unrelated to the Working set measurement tool):
-
Missed restrict opportunities
With shadow memory, you should be able to track whether there is aliasing in practice for a given execution, and whether annotating with restrict would reduce the number of loads and stores. -
Missed vectorization opportunities
I’m not exactly sure how you could annotate on this front, but this blog post ( http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html ) describes where some diagnostics are already present. If you can also determine whether those optimizations would be valuable through instrumentation, then it could be valuable to report it.
Hi Derek,
Thanks for proposing this. It seems like it might be an interesting
tool for us too. But this proposal seems a bit hand-wavy, and I think
it's missing some crucial info before we start heading this way.
TL;DR: We plan to build a suite of compiler-based dynamic instrumentation
tools for analyzing targeted performance problems. These tools will all
live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as
they will share significant portions of their implementations.
I will bike-shed the name as much as anyone else, but I'll stay away for now
====================
MotivationOur goal is to build a suite of dynamic instrumentation tools for analyzing
particular performance problems that are difficult to evaluate using other
profiling methods. Modern hardware performance counters provide insight
into where time is spent and when micro-architectural events such as cache
misses are occurring, but they are of limited effectiveness for contextual
analysis: it is not easy to answer *why* a cache miss occurred.Examples of tools that we have planned include: identifying wasted or
redundant computation, identifying cache fragmentation, and measuring
working sets. See more details on these below.====================
ApproachWe believe that tools with overhead beyond about 5x are simply too
heavyweight to easily apply to large, industrial-sized applications running
real-world workloads. Our goal is for our tools to gather useful
information with overhead less than 5x, and ideally closer to 3x, to
facilitate deployment. We would prefer to trade off accuracy and build a
less-accurate tool below our overhead ceiling than to build a high-accuracy
but slow tool. We hope to hit a sweet spot of tools that gather trace-based
contextual information not feasible with pure sampling yet are still
practical to deploy.
This is also very important for us, and there are sanitizers which are
"harder to sell" because of the overhead.
In a similar vein, we would prefer a targeted tool that analyzes one
particular aspect of performance with low overhead than a more general tool
that can answer more questions but has high overhead.Dynamic binary instrumentation is one option for these types of tools, but
typically compiler-based instrumentation provides better performance, and we
intend to focus only on analyzing applications for which source code is
available. Studying instruction cache behavior with compiler
instrumentation can be challenging, however, so we plan to at least
initially focus on data performance.Many of our planned tools target specific performance issues with data
accesses. They employ the technique of *shadow memory* to store metadata
about application data references, using the compiler to instrument loads
and stores with code to update the shadow memory. A companion runtime
library intercepts libc calls if necessary to update shadow memory on
non-application data references. The runtime library also intercepts heap
allocations and other key events in order to perform its analyses. This is
all very similar to how existing sanitizers such as AddressSanitizer,
ThreadSanitizer, MemorySanitizer, etc. operate today.====================
Example ToolsWe have several initial tools that we plan to build. These are not
necessarily novel ideas on their own: some of these have already been
explored in academia. The idea is to create practical, low-overhead,
robust, and publicly available versions of these tools.*Cache fragmentation*: this tool gather data structure field hotness
information, looking for data layout optimization opportunities by grouping
hot fields together to avoid data cache fragmentation. Future enhancements
may add field affinity information if it can be computed with low enough
overhead.*Working set measurement*: this tool measures the data working set size of
an application at each snapshot during execution. It can help to understand
phased behavior as well as providing basic direction for further effort by
the developer: e.g., knowing whether the working set is close to fitting in
current L3 caches or is many times larger can help determine where to spend
effort.*Dead store detection*: this tool identifies dead stores (write-after-write
patterns with no intervening read) as well as redundant stores (writes of
the same value already in memory). Xref the Deadspy paper from CGO 2012.*Single-reference*: this tool identifies data cache lines brought in but
only read once. These could be candidates for non-temporal loads.
Do you have any estimates on memory overhead (both memory usage
(+shadow), and code size) you expect? As well as estimated about the
possible slowdown?
At least for the tools you are currently starting to implement, it
would be nice to have some estimates and plans on what is going to
happen.
I would actually like to see a small RFC about each tool and what the
plan (overhead/slowdown, pseudo-code for instrumentation, UX, ...) is
before starting to commit.
I don't expect the plan to be very detailed, nor for everything to be
pinned down, of course. This seems to be a bit at a research stage,
and I'm totally ok with it. But would rather it not be as opaque as it
is now. The way the tools will report problems/data/etc, is important,
for example.
====================
EfficiencySanitizerWe are proposing the name EfficiencySanitizer, or "esan" for short, to refer
to this suite of dynamic instrumentation tools for improving program
efficiency. As we have a number of different tools that share quite a bit
of their implementation we plan to consider them sub-tools under the
EfficiencySanitizer umbrella, rather than adding a whole bunch of separate
instrumentation and runtime library components.
How much code is expected to be shared? Most? Similar to what the
sanitizers already share?
Do we expect the shadow memory mapping to be (mostly) the same among
all esan's tools?
Do you already have an idea of a few tools + the type of code that
would be shared (for example: "read/write instrumentation is mostly
the same among these tools", or "generating reports for these tools is
mostly the same", or something similar)?
While these tools are not addressing correctness issues like other
sanitizers, they will be sharing a lot of the existing sanitizer runtime
library support code. Furthermore, users are already familiar with the
sanitizer brand, and it seems better to extend that concept rather than add
some new term.
Not the email to bike-shed the name, but I don't like "Efficiency"
that much here
Thank you for working on this,
Filipe
Thanks for proposing this. It seems like it might be an interesting
tool for us too. But this proposal seems a bit hand-wavy, and I think
it's missing some crucial info before we start heading this way.At least for the tools you are currently starting to implement, it
would be nice to have some estimates and plans on what is going to
happen.
I would actually like to see a small RFC about each tool and what the
plan (overhead/slowdown, pseudo-code for instrumentation, UX, ...) is
before starting to commit.
I don't expect the plan to be very detailed, nor for everything to be
pinned down, of course. This seems to be a bit at a research stage,
and I'm totally ok with it. But would rather it not be as opaque as it
is now. The way the tools will report problems/data/etc, is important,
for example.
The main response here is that these tools are in an exploratory stage and
that what we are proposing is not a fixed list of tools but a framework
within which we can build prototypes and refine them. We do not know which
ideas will work out and which ones will be abandoned. In some cases,
answers will not all be known until a prototype is in place to see how it
behaves on large applications: we might need to add additional shadow
metadata or additional instrumentation to produce the right amount of
actionable output. What we want to avoid is having to build prototypes
out-of-tree until they are proven and polished and only then trying to
commit large patches upstream.
Do you have any estimates on memory overhead (both memory usage
(+shadow), and code size) you expect? As well as estimated about the
possible slowdown?
The shadow memory for tools we are considering include 64:1, 4:1, and 1:1.
This is all within the ranges of existing sanitizers. Each tool will have
inlined instrumentation to update shadow memory on either every memory
access or in some cases just focusing on the heap; I do not have a number
for code size expansion. The slowdown ranges were in the original email:
2x-5x.
We are proposing the name EfficiencySanitizer, or "esan" for short, to
refer
> to this suite of dynamic instrumentation tools for improving program
> efficiency. As we have a number of different tools that share quite a
bit
> of their implementation we plan to consider them sub-tools under the
> EfficiencySanitizer umbrella, rather than adding a whole bunch of
separate
> instrumentation and runtime library components.How much code is expected to be shared? Most? Similar to what the
sanitizers already share?
More than what the existing sanitizers share.
Do we expect the shadow memory mapping to be (mostly) the same among
all esan's tools?
Most will use a single mapping and single algorithm, but the scaledown will
vary as mentioned.
Do you already have an idea of a few tools + the type of code that
would be shared (for example: "read/write instrumentation is mostly
the same among these tools", or "generating reports for these tools is
mostly the same", or something similar)?
Instrumentation will be very similar (minor differences in shadow scaling
and shadow metadata format), identical intrinsic handling, identical
slowpath entries, identical libc interception, identical shadow
management. Reports are not fully explored.
While these tools are not addressing correctness issues like other
> sanitizers, they will be sharing a lot of the existing sanitizer runtime
> library support code. Furthermore, users are already familiar with the
> sanitizer brand, and it seems better to extend that concept rather than
add
> some new term.Not the email to bike-shed the name, but I don't like "Efficiency"
that much here
There were name discussions prior to the RFC and the conclusion was that
there is just not going to be a single first-place choice for everyone.
Heh, so I really do like “efficiency”, much more than performance or quickness.
But I really do think that we shouldn’t try to bikeshed the name much. IMO, while it’s great to throw out ideas around better names, and certainly to point out any serious problems with a name, but past that I’ll be happy with whatever the folks actually working on this pick.
I think we should all focus on the relevant technical discussion, and float any name ideas that we have, and let Derek and Qin make the final call on the name.
-Chandler
This is a cool project! Looking forward to seeing what kinds of data we can get.
Interesting idea! I understand how the bookkeeping in the tool is similar to some of the sanitizers but I am wondering whether that is really the best developer’s work-flow for such a tool.
I could imagine that some of the opportunities discovered by the tool could be optimized automatically by the compiler (e.g. temporal loads, sw prefetching, partitioning the heap) so feeding this information back to the compiler could be highly useful. I am wondering whether the PGO model is closer to what we want at the end. The problem can also be thought of as a natural extension of PGO. Besides instrumenting branches and indirect calls, it adds instrumentation for loads and stores.
We have internally been discussing ways to use PGO for optimization diagnostics (a continuation of Tyler’s work, see http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html). The idea is to help the developer to focus in on opportunities in hot code. It seems that the diagnostics provided by your tools could be emitted directly by the analyses in LLVM during the profile-use phase.
Adam
Some of this data might be interesting for profile guidance. Are there any plans there?
– Sean Silva
Hi Derek,
I'm not an expert in any of these topics, but I'm excited that you
guys are doing it. It seems like a missing piece that needs to be
filled.
Some comments inline...
We would prefer to trade off accuracy and build a
less-accurate tool below our overhead ceiling than to build a high-accuracy
but slow tool.
I agree with this strategy.
As a first approach, making the fastest you can, then later
introducing more probes, maybe via some slider flag (like -ON) to
consciously trade speed for accuracy.
Studying instruction cache behavior with compiler
instrumentation can be challenging, however, so we plan to at least
initially focus on data performance.
I'm interested in how you're going to do this without kernel profiling
probes, like perf.
Or is the point here introducing syscalls in the right places instead
of randomly profiled? Wouldn't that bias your results?
Many of our planned tools target specific performance issues with data
accesses. They employ the technique of *shadow memory* to store metadata
about application data references, using the compiler to instrument loads
and stores with code to update the shadow memory.
Is it just counting the number of reads/writes? Or are you going to
add how many of those accesses were hit by a cache miss?
*Cache fragmentation*: this tool gather data structure field hotness
information, looking for data layout optimization opportunities by grouping
hot fields together to avoid data cache fragmentation. Future enhancements
may add field affinity information if it can be computed with low enough
overhead.
Would be also good to have temporal information, so that you can
correlate data access that occurs, for example, inside the same loop /
basic block, or in sequence in the common CFG flow. This could lead to
change in allocation patterns (heap, BSS).
*Working set measurement*: this tool measures the data working set size of
an application at each snapshot during execution. It can help to understand
phased behavior as well as providing basic direction for further effort by
the developer: e.g., knowing whether the working set is close to fitting in
current L3 caches or is many times larger can help determine where to spend
effort.
This is interesting, but most useful when your dataset changes size
over different runs. This is similar to running the program under perf
for different workloads, and I'm not sure how you're going to get that
in a single run. It also comes with the additional problem that cache
sizes are not always advertised, so you might have an additional tool
to guess the sizes based on increasing the size of data blocks and
finding steps on the data access graph.
*Dead store detection*: this tool identifies dead stores (write-after-write
patterns with no intervening read) as well as redundant stores (writes of
the same value already in memory). Xref the Deadspy paper from CGO 2012.
This should probably be spotted by the compiler, so I guess it's a
tool for compiler developers to spot missed optimisation opportunities
in the back-end.
*Single-reference*: this tool identifies data cache lines brought in but
only read once. These could be candidates for non-temporal loads.
That's nice and should be simple enough to get a report in the end.
This also seem to be a hint to compiler developers rather than users.
I think you guys have a nice set of tools to develop and I'm looking
forward to working with them.
cheers,
--renato
Hi Derek,
I'm not an expert in any of these topics, but I'm excited that you
guys are doing it. It seems like a missing piece that needs to be
filled.Some comments inline...
We would prefer to trade off accuracy and build a
less-accurate tool below our overhead ceiling than to build a high-accuracy
but slow tool.I agree with this strategy.
As a first approach, making the fastest you can, then later
introducing more probes, maybe via some slider flag (like -ON) to
consciously trade speed for accuracy.Studying instruction cache behavior with compiler
instrumentation can be challenging, however, so we plan to at least
initially focus on data performance.I'm interested in how you're going to do this without kernel profiling
probes, like perf.Or is the point here introducing syscalls in the right places instead
of randomly profiled? Wouldn't that bias your results?Many of our planned tools target specific performance issues with data
accesses. They employ the technique of *shadow memory* to store metadata
about application data references, using the compiler to instrument loads
and stores with code to update the shadow memory.Is it just counting the number of reads/writes? Or are you going to
add how many of those accesses were hit by a cache miss?*Cache fragmentation*: this tool gather data structure field hotness
information, looking for data layout optimization opportunities by grouping
hot fields together to avoid data cache fragmentation. Future enhancements
may add field affinity information if it can be computed with low enough
overhead.Would be also good to have temporal information, so that you can
correlate data access that occurs, for example, inside the same loop /
basic block, or in sequence in the common CFG flow. This could lead to
change in allocation patterns (heap, BSS).*Working set measurement*: this tool measures the data working set size of
an application at each snapshot during execution. It can help to understand
phased behavior as well as providing basic direction for further effort by
the developer: e.g., knowing whether the working set is close to fitting in
current L3 caches or is many times larger can help determine where to spend
effort.This is interesting, but most useful when your dataset changes size
over different runs. This is similar to running the program under perf
for different workloads, and I'm not sure how you're going to get that
in a single run. It also comes with the additional problem that cache
sizes are not always advertised, so you might have an additional tool
to guess the sizes based on increasing the size of data blocks and
finding steps on the data access graph.*Dead store detection*: this tool identifies dead stores (write-after-write
patterns with no intervening read) as well as redundant stores (writes of
the same value already in memory). Xref the Deadspy paper from CGO 2012.This should probably be spotted by the compiler, so I guess it's a
tool for compiler developers to spot missed optimisation opportunities
in the back-end.
Not when dead store happens in an external DSO where compiler can't detect it (same applies for single references).
Given that you're going to store metadata for 4-64 cache lines of user
memory in a single shadow memory cache line, do you anticipate
additional slowdown related to false sharing of shadow memory in
multithreaded applications? Any plans to deal with that?
Do you mean the ones between the DSO and the instrumented code?
Because if it's just in the DSO itself, then the compiler could have
spotted it, too, when compiling the DSO.
I mean, of course there are cases line interprocedural dead-store
(call a function that changes A while changing A right after), but
that again, could be found at compilation time, given enough inlining
depth or IP analysis.
Also, if this is not something the compiler can fix, what is the point
of detecting dead-stores? For all the non-trivial cases the compiler
can't spot, most will probably arise from special situations where the
compiler is changing the code to expose the issue, and thus, the user
has little control over how to fix the underlying problem.
cheers,
--renato
Not when dead store happens in an external DSO where compiler can't detect
it (same applies for single references).Do you mean the ones between the DSO and the instrumented code?
Because if it's just in the DSO itself, then the compiler could have
spotted it, too, when compiling the DSO.I mean, of course there are cases line interprocedural dead-store
(call a function that changes A while changing A right after), but
that again, could be found at compilation time, given enough inlining
depth or IP analysis.
When I read the description, I assumed it would (mostly) be used to
detect those inter-procedural dead-stores, that the compiler can't see
(without LTO, at least).
The "external DSO" case also exists, but unless the DSO is also
instrumented, you'd get lots of false-negatives (which aren't "a big
problem" with the sanitizers, but of course we want to minimize them
(you'd do it by also instrumenting the DSO)).
Also, if this is not something the compiler can fix, what is the point
of detecting dead-stores? For all the non-trivial cases the compiler
can't spot, most will probably arise from special situations where the
compiler is changing the code to expose the issue, and thus, the user
has little control over how to fix the underlying problem.
Same as the other sanitizers: The compiler can't fix, but you (the
programmer) can!
I don't think the dead-stores would mostly come from the compiler
changing code around. I think they'd most likely come from the other
example you mentioned, where you call a function which writes
somewhere, and then you write over it, with no intervening read.
If this happens a lot with a given function, maybe you want to write
to some parts of the structure conditionally.
Derek, Qin:
Since this is mostly being researched as it is being implemented (and
in the public repo), how do you plan to coordinate with the rest of
the community? (Current status, what's "left" to get a "useful"
implementation, etc)
About the working set tool:
How are you thinking about doing the snapshots? How do you plan to
sync the several threads?
Spawning an external process/"thread" (kind of like LSan), or internally?
About the tools in general:
Do you expect any of the currently planned ones to be intrusive, and
require the user to change their code before they can use the tool
with good results?
Thank you,
Filipe
Indeed.
cheers,
--renato
Derek, Qin:
Since this is mostly being researched as it is being implemented (and
in the public repo), how do you plan to coordinate with the rest of
the community? (Current status, what's "left" to get a "useful"
implementation, etc)
Mostly via llvm-dev. Do you have another suggestion?
About the working set tool:
How are you thinking about doing the snapshots? How do you plan to
sync the several threads?
Spawning an external process/"thread" (kind of like LSan), or internally?
As requested I sent a separate RFC about the working set tool. Let's
discuss on that thread.
About the tools in general:
Do you expect any of the currently planned ones to be intrusive, and
require the user to change their code before they can use the tool
with good results?
No, we do not want to rely on annotations or anything like that.