RFC: Comprehensive Static Instrumentation

Hey LLVM-dev,

We propose to build the CSI framework to provide a comprehensive suite of compiler-inserted instrumentation hooks that dynamic-analysis tools can use to observe and investigate program runtime behavior. Traditionally, tools based on compiler instrumentation would each separately modify the compiler to insert their own instrumentation. In contrast, CSI inserts a standard collection of instrumentation hooks into the program-under-test. Each CSI-tool is implemented as a library that defines relevant hooks, and the remaining hooks are “nulled” out and elided during link-time optimization (LTO), resulting in instrumented runtimes on par with custom instrumentation. CSI allows many compiler-based tools to be written as simple libraries without modifying the compiler, greatly lowering the bar for

developing dynamic-analysis tools.

CC’ing the mailing list for the CSI project.

I am very glad this project reached the state where we can start the public code review. Please shoot the patches for review when ready.

–kcc

Some of this overlaps with the features in XRay (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html).

  • Matthias

Hey LLVM-dev,

We propose to build the CSI framework to provide a comprehensive suite of compiler-inserted instrumentation hooks that dynamic-analysis tools can use to observe and investigate program runtime behavior. Traditionally, tools based on compiler instrumentation would each separately modify the compiler to insert their own instrumentation. In contrast, CSI inserts a standard collection of instrumentation hooks into the program-under-test. Each CSI-tool is implemented as a library that defines relevant hooks, and the remaining hooks are “nulled” out and elided during link-time optimization (LTO), resulting in instrumented runtimes on par with custom instrumentation. CSI allows many compiler-based tools to be written as simple libraries without modifying the compiler, greatly lowering the bar for
developing dynamic-analysis tools.

================
Motivation

Key to understanding and improving the behavior of any system is visibility – the ability to know what is going on inside the system. Various dynamic-analysis tools, such as race detectors, memory checkers, cache simulators, call-graph generators, code-coverage analyzers, and performance profilers, rely on compiler instrumentation to gain visibility into the program behaviors during execution. With this approach, the tool writer modifies the compiler to insert instrumentation code into the program-under-test so that it can execute behind the scene while the program-under-test runs. This approach, however, means that the development of new tools requires compiler work, which many potential tool writers are ill equipped to do, and thus raises the bar for building new and innovative tools.

The goal of the CSI framework is to provide comprehensive static instrumentation through the compiler, in order to simplify the task of building efficient and effective platform-independent tools. The CSI framework allows the tool writer to easily develop analysis tools that require
compiler instrumentation without needing to understand the compiler internals or modifying the compiler, which greatly lowers the bar for developing dynamic-analysis tools.

================
Approach

The CSI framework inserts instrumentation hooks at salient locations throughout the compiled code of a program-under-test, such as function entry and exit points, basic-block entry and exit points, before and after each memory operation, etc. Tool writers can instrument a program-under-test simply by first writing a library that defines the semantics of relevant hooks
and then statically linking their compiled library with the program-under-test.

At first glance, this brute-force method of inserting hooks at every salient location in the program-under-test seems to be replete with overheads. CSI overcomes these overheads through the use of link-time-optimization (LTO), which is now readily available in most major compilers, including GCC and LLVM. Using LTO, instrumentation hooks that are not used by a particular tool can be elided, allowing the overheads of these hooks to be avoided when the

I don’t understand this flow: the front-end emits all the possible instrumentation but the useless calls to the runtime will be removed during the link?
It means that the final binary is specialized for a given tool right? What is the advantage of generating this useless instrumentation in the first place then? I’m missing a piece here…

instrumented program-under-test is run. Furthermore, LTO can optimize a tool’s instrumentation within a program using traditional compiler optimizations. Our initial study indicates that the use of LTO does not unduly slow down the build time

This is a false claim: LTO has a very large overhead, and especially is not parallel, so the more core you have the more the difference will be. We frequently observes builds that are 3 times slower. Moreover, LTO is not incremental friendly and during debug (which is very relevant with sanitizer) rebuilding involves waiting for the full link to occur again.

Can you clarify the scope of tools that you want to develop? Are these profiling tools, security enforcement tools, debugging tools, etc? The type of tools you want to build will dictate whether such a framework makes sense. The algorithms for optimizing away run-time hooks is not necessarily uniform. For example, if a tool instruments loads and stores to collect the set of memory locations accessed by a program, then optimizing away a redundant check on a store is okay. If the instrumentation is meant to enforce memory safety, then redundant checks can only be optimized away if there is no intervening call to free() between the two checks (which may require inter-procedural analysis to determine). In such a case, you either need to be very pessimistic about the optimizations that you use, or you will get incorrect optimizations for certain classes of dynamic analyses. Is your intention to have a compiler flag that enables insertions of hooks, or are you planning on having a pass always add the hooks and having libLTO always remove them? I assume the former, but you should probably clarify. Don’t forget that atomic instructions (e.g., compare-and-swap) and some of the intrinsics (e.g., llvm.memcpy()) also access memory. As an aside, I’m not sure that I buy the idea that tool developers should be oblivious to the compiler internals. If a tool developer doesn’t understand what the compiler is doing, then she/he may not understand the results of the output. For example, LLVM load/stores do not include stack spill slots, memory accesses incurred by calling conventions, etc. Depending on where the instrumentation passes are placed in the pass pipeline, instrumentation calls can be moved or removed (perhaps in undesirable ways for some dynamic analysis applications). I would also argue that a key design feature of LLVM is to make writing such passes simple, and I think LLVM accomplishes this. If one understands how to build an efficient dynamic analysis, one can probably handle writing the compiler passes. Overall, I think having common instrumentation infrastructure is a good thing. However, I’m not sure how common it can be across different applications of instrumentation. As an example, most memory safety solutions have the same instrumentation and optimization needs (and constraints on optimization) regardless of how they implement the checks on loads and stores. However, it’s less clear to me that a race detector and a memory safety compiler can safely use the same optimizations. You may find yourself implementing a common infrastructure with each tool implementing specialized optimizations to make each type of dynamic analysis really fast. Food for thought, John Criswell

We’ve just released the project code for public review. You can find the diffs at the following links:

CSI LLVM pass: http://reviews.llvm.org/D21445
CSI Clang support: http://reviews.llvm.org/D21446

CSI Runtime and tests: http://reviews.llvm.org/D21447

The RST for the CSI project can be found with the Clang diff.

Hey Mehdi,

Thank you for your comments. I’ve CC’d the CSI mailing list with your comments and put my responses inline. Please let me know any other questions you have.

Cheers,
TB

Hi TB,

Thanks for you answer.

Hey Mehdi,

Thank you for your comments. I’ve CC’d the CSI mailing list with your comments and put my responses inline. Please let me know any other questions you have.

Cheers,
TB

Ok this is roughly what I had in mind.

I still believe it is not great to rely on LTO, and better, it is not needed to achieve this result.

For instance, I don’t see why the “library” that defines the subset of instrumentation hooks used by this tool can’t be fed during a regular compile, and the useless hook be eliminated at this point.
Implementation detail, but in practice, instead of feeding the library itself, the “framework” that allows to generate the library for the tool writer can output a “configuration file” along side the library, and this configuration file is what is fed to the compiler and tells the instrumentation pass which of the hooks to generate. It sounds more efficient to me, and remove the dependency on LTO.
I imagine there is a possible drawback that I’m missing right now…

I expect this to be reproducible on most non-trivial C/C++ programs.
But taking clang as an example, just running ninja clang on OS X a not-so-recent 12-cores machine takes 970s with LTO and 252s without (and I believe this is without debug info…).
Running just ninja to build all of llvm/clang here would take a lot longer with LTO, and not so much without.

The LTO builds without assert

Best,

Hey John,

Thanks for your comments. I’ve CC’d the CSI mailing list with your comments and responded inline. Please let me know if you have further questions.

Cheers,
TB

Hi TB,

Thanks for you answer.

Hey Mehdi,

Thank you for your comments. I've CC'd the CSI mailing list with your
comments and put my responses inline. Please let me know any other
questions you have.

Cheers,
TB

Hey LLVM-dev,

We propose to build the CSI framework to provide a comprehensive suite of
compiler-inserted instrumentation hooks that dynamic-analysis tools can use
to observe and investigate program runtime behavior. Traditionally, tools
based on compiler instrumentation would each separately modify the compiler
to insert their own instrumentation. In contrast, CSI inserts a standard
collection of instrumentation hooks into the program-under-test. Each
CSI-tool is implemented as a library that defines relevant hooks, and the
remaining hooks are "nulled" out and elided during link-time optimization
(LTO), resulting in instrumented runtimes on par with custom
instrumentation. CSI allows many compiler-based tools to be written as
simple libraries without modifying the compiler, greatly lowering the bar
for
developing dynamic-analysis tools.

================
Motivation

Key to understanding and improving the behavior of any system is
visibility -- the ability to know what is going on inside the system.
Various dynamic-analysis tools, such as race detectors, memory checkers,
cache simulators, call-graph generators, code-coverage analyzers, and
performance profilers, rely on compiler instrumentation to gain visibility
into the program behaviors during execution. With this approach, the tool
writer modifies the compiler to insert instrumentation code into the
program-under-test so that it can execute behind the scene while the
program-under-test runs. This approach, however, means that the
development of new tools requires compiler work, which many potential tool
writers are ill equipped to do, and thus raises the bar for building new
and innovative tools.

The goal of the CSI framework is to provide comprehensive static
instrumentation through the compiler, in order to simplify the task of
building efficient and effective platform-independent tools. The CSI
framework allows the tool writer to easily develop analysis tools that
require
compiler instrumentation without needing to understand the compiler
internals or modifying the compiler, which greatly lowers the bar for
developing dynamic-analysis tools.

================
Approach

The CSI framework inserts instrumentation hooks at salient locations
throughout the compiled code of a program-under-test, such as function
entry and exit points, basic-block entry and exit points, before and after
each memory operation, etc. Tool writers can instrument a
program-under-test simply by first writing a library that defines the
semantics of relevant hooks
and then statically linking their compiled library with the
program-under-test.

At first glance, this brute-force method of inserting hooks at every
salient location in the program-under-test seems to be replete with
overheads. CSI overcomes these overheads through the use of
link-time-optimization (LTO), which is now readily available in most major
compilers, including GCC and LLVM. Using LTO, instrumentation hooks that
are not used by a particular tool can be elided, allowing the overheads of
these hooks to be avoided when the

I don't understand this flow: the front-end emits all the possible
instrumentation but the useless calls to the runtime will be removed during
the link?
It means that the final binary is specialized for a given tool right?
What is the advantage of generating this useless instrumentation in the
first place then? I'm missing a piece here...

Here's the idea. When a tool user, who has a program they want to
instrument, compiles their program source into an object/bitcode, he can
turn on the CSI compile-time pass to insert instrumentation hooks (function
calls to instrumentation routines) throughout the IR of the program.
Separately, a tool writer implements a particular tool by writing a library
that defines the subset of instrumentation hooks she cares about. At link
time, the object/bitcode of the program source is linked with the object
file/bitcode of the tool, resulting in a tool-instrumented executable.
When LTO is used at link time, unused instrumentation is elided, and
additional optimizations can run on the instrumented program. (I'm happy
to send you a nice picture that we have of this flow, if the mailing list
doesn't mind.)

Ok this is roughly what I had in mind.

I still believe it is not great to rely on LTO, and better, it is not
needed to achieve this result.

For instance, I don't see why the "library" that defines the subset of
instrumentation hooks used by this tool can't be fed during a regular
compile, and the useless hook be eliminated at this point.
Implementation detail, but in practice, instead of feeding the library
itself, the "framework" that allows to generate the library for the tool
writer can output a "configuration file" along side the library, and this
configuration file is what is fed to the compiler and tells the
instrumentation pass which of the hooks to generate. It sounds more
efficient to me, and remove the dependency on LTO.
I imagine there is a possible drawback that I'm missing right now...

I agree that the tool does not need to depend on full LTO. What is needed
is essentially an option or configuration such that the compiler can find
the bit code file(s) for the hooks during compilation time. It is pretty
much similar to how math function inlining can be done ...

David

Matthias beat me to it!

From reading the RFC, it seems that some of what XRay is doing on the instrumentation side is very similar to what CSI enables. The current implementation I’m working with (in http://reviews.llvm.org/D19904) requires some very deep integration with the LLVM compiler infrastructure (essentially a machine function pass, and instruction lowering on a per-platform basis).

The development of XRay has a few trade-offs for code-size effect and runtime overhead, which I suspect are unique to XRay’s target use-case which is for efficient function call accounting/tracing.

So I have two high-level questions:

  • Will something like XRay be implementable on-top of CSI?
  • How will the runtime components cooperate or be handled by tools built on top of CSI?

Cheers

Suppose I want to build a production build, and one build for each of ASAN, MSAN, UBSAN, and TSAN.

With the current approach, I need to compile my source five different times, and link five different times.

With the CSI approach (assuming it was the backing technology behind the sanitizers), I need to compile twice (once for production, once for instrumentation), then LTO-link five times. I can reuse my .o files across the sanitizer types.

It’s possible that the math doesn’t really work out in practice if the cost of the LTO-link dwarfs the compile times.

The CSI framework inserts instrumentation hooks at salient locations
throughout the compiled code of a program-under-test, such as function
entry and exit points, basic-block entry and exit points, before and after
each memory operation, etc. Tool writers can instrument a
program-under-test simply by first writing a library that defines the
semantics of relevant hooks
and then statically linking their compiled library with the
program-under-test.

At first glance, this brute-force method of inserting hooks at every
salient location in the program-under-test seems to be replete with
overheads. CSI overcomes these overheads through the use of
link-time-optimization (LTO), which is now readily available in most major
compilers, including GCC and LLVM. Using LTO, instrumentation hooks that
are not used by a particular tool can be elided, allowing the overheads of
these hooks to be avoided when the

I don't understand this flow: the front-end emits all the possible
instrumentation but the useless calls to the runtime will be removed during
the link?
It means that the final binary is specialized for a given tool right? What
is the advantage of generating this useless instrumentation in the first
place then? I'm missing a piece here...

Suppose I want to build a production build, and one build for each of
ASAN, MSAN, UBSAN, and TSAN.

With the current approach, I need to compile my source five different
times, and link five different times.

With the CSI approach (assuming it was the backing technology behind the
sanitizers), I need to compile twice (once for production, once for
instrumentation), then LTO-link five times. I can reuse my .o files across
the sanitizer types.

It's possible that the math doesn't really work out in practice if the
cost of the LTO-link dwarfs the compile times.

Other than the build time, we should also consider the performance of the
produced binary, which might be more important.
I have hard time to believe that the LTO-link optimized (CSI version 1)
binary could beat the original ASan IR instrumentation based binary.
With IR instrumentation, the binary could benefit both from problem
specific domain knowledge and comprehensive compiler optimizations:
e.g., inline small code without context switch, skip redundant load/store
instrumentation, and more aggressive optimization since the compiler sees
everything.
I am not sure if CSI could do any of them.
IMHO, CSI might be good for fast prototype research work and may fall short
when we are really serious about the performance.

I agree that the tool does not need to depend on full LTO. What is needed
is essentially an option or configuration such that the compiler can find
the bit code file(s) for the hooks during compilation time. It is pretty
much similar to how math function inlining can be done ...

I'd expect CSI to work with ThinLTO (which is still undocumented, sadly)

out of the box.
Empty hooks will be eliminated completely, trivial hooks will be inlined.
Non-trivial ones will stay.

I would also argue that a key design feature of LLVM is to make writing
such passes simple, and I think LLVM accomplishes this. If one understands
how to build an efficient dynamic analysis, one can probably handle writing
the compiler passes.
John Criswell

I disagree here.
LLVM *is* the simplest-to-use compiler infrastructure I've used,
but not having to deal with a compiler guts for prototyping ideas is still
an order of magnitude simpler.
For most of the CSI use cases the users will not even need to understand
anything about compilers (except for the concept of basic blocks).

And we should not expect CSI to be as fast and as specializable as a custom
pass.
CSI's power is in the ability to attract more people to do quick-and-easy
experiments.

--kcc

yes, it is pretty close.

David

Hey Ben,

Thank you for your comments. I’ve put my response inline.

Cheers,
TB

Hey Ben,

Thank you for your comments. I’ve put my response inline.

Cheers,
TB

It is a very artificial advantage, what are you saving? Temporary Disk Space?

These results suggests that “adding instrumentation has a cost” nothing more, and is unrelated to LTO at all.
You would provide the runtime to the compiler directly during the compile phase and you would get the same results.

LTO means basically “compiles during the link”. You won’t save much.

I haven’t seen a single compelling argument to tie CSI to LTO in this thread until now.

Mehdi, I think TB is saying that LTO is actually *more* incremental than
standard compilation for people writing instrumentation tools.

It's far simpler to rebuild your CSI instrumentation hooks and re-run LTO
on Apache than it is to re-run someone else's complicated build system. In
fact, you save all the time that clang would spend parsing C and C++ source
code, since you're starting with several blobs of semi-optimized LLVM IR.

It's worth exploring what you mentioned up-thread, where you inline the
instrumentation into the code during normal compliation instead of waiting
until link time, but I think it makes more sense to make it easy to build
tools before we try to make those new tools compile faster. It seems like a
nice-to-have feature for instrumentation tools that manage to graduate from
research idea to production tool.