RFC: XRay Profiling in LLVM

Objective

Implement an XRay mode ‘xray-profiling’ that gathers stack trace latencies/durations and builds histograms to provide basic statistics about where time is going in an execution of the application.

Background

XRay has two modes currently implemented in compiler-rt: a basic (nee naive) mode and flight data recorder (FDR) mode.

Basic mode logging, when enabled, will collect traces and write verbose logs to disk. This mode has some unsophisticated filtering and stack-depth limiting support. This mode is useful for tracing short-lived applications, so that execution can be traced throughout the lifetime of the application. We’ve used basic mode logging to identify sources of latency in Clang/LLVM.

FDR mode logging is designed to record traces and aggressively filter out events that fall below a latency threshold. This mode uses a fixed amount of heap space to record events as they’re happening, and only writes out the logs on demand. FDR mode is used mostly for long-running applications and services.

We also already have tools in LLVM that can turn these XRay traces into other forms that are easier to digest. One tool turns XRay traces into something that the Chrome Trace Viewer can render [1] and another summarises it to generate latency flame graphs [2].

Problem Statement

Both basic and FDR mode logging can create massive traces. One example, where a fully-instrumented clang binary was traced with XRay’s basic mode, we found that it could generate an 89GB trace compiling the canonical “Hello, World!” program.

One way to address this problem is to limit the amount of memory being used by the implementation through flags. Unfortunately that also means basic mode XRay will block more on writing out the log files.

Filtering at runtime also takes some CPU time, so the overheads of writing out the log as well as filtering the records become non-trivial for heavily instrumented programs.

Proposed Solution

We’d like to propose a profiling mode to XRay, called ‘xray-profiling’. The intent is to implement handlers that will generate profile data [3] that other tools can digest. We’ll export the data in an XRay-specific format for a more compact representation. We intend to implement conversion tools to support the profile.proto format that pprof can handle as input.

To do this, we need to record the “stack” of functions that are XRay instrumented, represented as a prefix tree (or forest, as we’ll see later). This prefix tree will represent the XRay instrumented functions that have been entered, and then exited.

A() → B() → C()

→ C()

Each node in the tree will contain a histogram of the latencies/durations of the function by subtracting the timestamp at the time of exit from the timestamp at entry.

When the tracing functionality is finalised and flushed, we end up with a data file that represents the prefix tree in a summarised form following the profile.proto format.

Logistics

As with the Basic and FDR modes, the profiling mode will be implemented in compiler-rt and will be default-linked into any XRay instrumented binary.

In the future we can provide more targeted archives that can be controlled in clang, such that users can determine which runtime implementations they’d like to be linked into the final binary. This work may be done in parallel, but is unrelated to the work involved with the XRay profiling mode implementation.

Call for Reviewers

One of the issues we’re finding with the XRay work is there’s very little community knowledge about what’s happening in the project. I would like to change that by encouraging more reviewers, interested in XRay, to please sign up to be a reviewer. Eventually, we’d like to grow the list of people contributing to XRay from the community and this is an opportunity to do so.

Open Questions

  • The triggering mechanism is still through programmatically turning on/off XRay instrumentation in an XRay-instrumented binary. We’re exploring some other options based on ptrace (will send another RFC at some point about this) and a tool that will explicitly control the XRay tracing externally. The alternative to this is having a dedicated XRay thread that will periodically check whether a signal has been tripped, and do the standardised procedure for enabling/disabling/flushing trace data. Any alternatives to this that we may not have considered yet?

Thanks in advance!

References

[0] https://github.com/google/pprof
[1] http://lists.llvm.org/pipermail/llvm-dev/2017-December/119529.html
[2] https://llvm.org/docs/XRayExample.html#flame-graph-generation
[3] https://github.com/google/pprof/blob/master/proto/profile.proto

FYI: Patch is now available for review in https://reviews.llvm.org/D44620.

As an additional update, this is now a series of patches.

If you're interested in learning more about XRay in compiler-rt, please
feel free to dive in and review the code. Any insights/questions/feedback
would be most appreciated!

https://reviews.llvm.org/D45757 - Part 2: Function Call Trie
https://reviews.llvm.org/D45758 - Part 3: Profile Collector Service
https://reviews.llvm.org/D44620 - Part 4: Profiler Mode Wiring
https://reviews.llvm.org/D45998 - Part 5: Profile File Writing

Cheers

FYI: Patch is now available for review in https://reviews.llvm.org/D44620.

# Objective

Implement an XRay mode 'xray-profiling' that gathers stack trace

latencies/durations and builds histograms to provide basic statistics about
where time is going in an execution of the application.

# Background

XRay has two modes currently implemented in compiler-rt: a basic (nee

naive) mode and flight data recorder (FDR) mode.

Basic mode logging, when enabled, will collect traces and write verbose

logs to disk. This mode has some unsophisticated filtering and stack-depth
limiting support. This mode is useful for tracing short-lived applications,
so that execution can be traced throughout the lifetime of the application.
We've used basic mode logging to identify sources of latency in Clang/LLVM.

FDR mode logging is designed to record traces and aggressively filter

out events that fall below a latency threshold. This mode uses a fixed
amount of heap space to record events as they're happening, and only writes
out the logs on demand. FDR mode is used mostly for long-running
applications and services.

We also already have tools in LLVM that can turn these XRay traces into

other forms that are easier to digest. One tool turns XRay traces into
something that the Chrome Trace Viewer can render [1] and another
summarises it to generate latency flame graphs [2].

# Problem Statement

Both basic and FDR mode logging can create massive traces. One example,

where a fully-instrumented clang binary was traced with XRay's basic mode,
we found that it could generate an 89GB trace compiling the canonical
"Hello, World!" program.

One way to address this problem is to limit the amount of memory being

used by the implementation through flags. Unfortunately that also means
basic mode XRay will block more on writing out the log files.

Filtering at runtime also takes some CPU time, so the overheads of

writing out the log as well as filtering the records become non-trivial for
heavily instrumented programs.

# Proposed Solution

We'd like to propose a profiling mode to XRay, called 'xray-profiling'.

The intent is to implement handlers that will generate profile data [3]
that other tools can digest. We'll export the data in an XRay-specific
format for a more compact representation. We intend to implement conversion
tools to support the profile.proto format that pprof can handle as input.

To do this, we need to record the "stack" of functions that are XRay

instrumented, represented as a prefix tree (or forest, as we'll see later).
This prefix tree will represent the XRay instrumented functions that have
been entered, and then exited.

A() -> B() -> C()
     \
       -> C()

Each node in the tree will contain a histogram of the

latencies/durations of the function by subtracting the timestamp at the
time of exit from the timestamp at entry.

When the tracing functionality is finalised and flushed, we end up with

a data file that represents the prefix tree in a summarised form following
the profile.proto format.

# Logistics

As with the Basic and FDR modes, the profiling mode will be implemented

in compiler-rt and will be default-linked into any XRay instrumented binary.

In the future we can provide more targeted archives that can be

controlled in clang, such that users can determine which runtime
implementations they'd like to be linked into the final binary. This work
may be done in parallel, but is unrelated to the work involved with the
XRay profiling mode implementation.

# Call for Reviewers

One of the issues we're finding with the XRay work is there's very

little community knowledge about what's happening in the project. I would
like to change that by encouraging more reviewers, interested in XRay, to
please sign up to be a reviewer. Eventually, we'd like to grow the list of
people contributing to XRay from the community and this is an opportunity
to do so.

# Open Questions

* The triggering mechanism is still through programmatically turning

on/off XRay instrumentation in an XRay-instrumented binary. We're exploring
some other options based on ptrace (will send another RFC at some point
about this) and a tool that will explicitly control the XRay tracing
externally. The alternative to this is having a dedicated XRay thread that
will periodically check whether a signal has been tripped, and do the
standardised procedure for enabling/disabling/flushing trace data. Any
alternatives to this that we may not have considered yet?