[XRay] Build instrumented Clang, some analysis results

Hi everyone,

TL;DR: With current pending patches applied in compiler-rt and llvm, and trunk clang, you can build your application with XRay tracing enabled on Linux with tracing enabled before main starts, and logging stops when the main thread exits.

Just a quick update, I have some patches under review that when applied cleanly to LLVM and compiler-rt allows for building applications with XRay tracing enabled on Linux binaries from start to end. The list of patches to apply are:

  * ⚙ D21987 [XRay] Implement `llvm-xray extract`, start of the llvm-xray tool (accounting tool, lives in llvm)
  * https://reviews.llvm.org/D21612 then https://reviews.llvm.org/D21982 (compiler-rt XRay runtime implementation)

To test this out, I built clang with XRay and ran it on a hello world application. Here's how I did it:

1) Build clang+compiler-rt with the above patches patched in. Follow the normal instructions for getting a version of clang (from trunk) built. I use CMake+Ninja to do it, and I build everything. Example session:

# assume ./llvm is set up appropriately
$ cd llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ../llvm
$ ninja
$ cd ..

2) Build clang again, preferably in another build directory where you can use the clang and compiler-rt built in #1, this time with the following additional flags for CMAKE_CXX_FLAGS: `-fxray-instrument` `-fxray-instruction-threshold=1`. The second flag tells clang+llvm to instrument functions that have at least one instruction in it. An example session of this would look something like:

$ mkdir xray-llvm-build
$ cd xray-llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=../llvm-build/bin/clang++ -DCMAKE_C_COMPILER=../llvm-build/bin/clang -DCMAKE_CXX_FLAGS='-fxray-instrument -fxray-instruction-threshold=1 -g' -DBUILD_SHARED_LIBS=Off ../llvm
$ ninja clang
$ cd ..

3) Run clang on a sample program, configure XRAY_OPTIONS while doing so for verbosity control. In my case I have the following "Hello, World!" program compiled by clang:

---->8 test.cc 8<----
#include <iostream>

int main(int argc, char* argv) {
  std::cout << "I'm being compiled with XRay!" << std::endl;
}
---->8 test.cc 8<----

$ XRAY_OPTIONS='verbosity=1' xray-llvm-build/bin/clang -o test.bin -O3 -x c++ -stdlib=libstdc++ test.cc

You should then see two lines similar to the following printed:

XRay: Log file in 'xray-log.DqqGQk'

4) Once you have one of those XRay log files, we can then use the 'xray-fc-account' tool built in the llvm-build/bin directory (in step 1). Warning: there's a lot of output in this step, I suggest piping to files. Example session:

$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k >sample.txt 2>errs.txt

We can even output it to CSV, which allows us to load the output in a spreadsheet application where we can start sorting and analysing the data as we see fit, by adding the "-format" flag to the above command:

$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k -format=csv >sample.csv 2>errs.txt

Analysis Results

Some general questions about X-Ray

Some general questions about X-Ray
-------------
Is there a plan to make a separate mailing list or project around
this? Do you have a list of planned features?

Interesting question -- so far we haven't decided yet whether XRay will live as another project. I'm certainly open to this possibility. No concrete plans yet. It's an open question in the original RFC too (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html).

There's a white paper that details what we plan to implement out in the open (XRay: A Function Call Tracing System – Google Research). We're still working our way to getting to a full version as described in that white paper (basically blocked by my lack of familiarity with the LLVM codebase, and other n00b-y things :D).

There's not a concrete list of features, and we're certainly open to contributions from the community to add features that make sense. :slight_smile:

Graphics tools for analysis? AMD open sourced their CodeAnalyst - What
about some integration with that?

Thanks for the pointer! Yes, I'd love to have more integration with existing visualisation tools that read a particular well-documented format. Others have mentioned Jumpshot which might be a little dated, but still something that some people use for similar things.

Linux + Perf support (planned/exists)?

There are no plans to support the perf counter-lookups yet. Although I certainly think that's a nice source of data to be logging XRay-style.

FWIW, the API for XRay allows us to decouple the things being logged at function entry/exit. Getting performance counters at those points is a nice idea, it should be doable.

How much is this tied to something specific about Linux or it could be
easily ported to another platform?

Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner.

There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.

What's the benefit of this vs a stable and production ready tool like Dtrace?

I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is:

- Dtrace requires kernel-side support.
- XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same).
- XRay is selective and configurable by the application developer.
- XRay's cost is borne by the application only, and does not require stopping the application.

How much overhead do you typically measure?

We've seen in the "null logging case" something around O(100) cycles in X86 for the trampoline side of XRay. Of course richer logging requires more cycles, and is completely implementation-dependent. The current one under development only writes fixed-sized records, uses __rdtscp(), does aligned writes only, and flushes when the buffer is full. The buffer is 32k per thread.

I haven't formally done benchmarks on the current implementation yet, but I'd be happy to do that soon.

If you're injection calls before/after every function - does it end up
blocking optimizations? Without looking at the implementation, if
you're injecting the calls late enough in the compilation process it
won't be a "problem", but if it's too early - you're going to end up
blocking a lot of optimizations and interfering with things a lot..

It's currently implemented as a MachineFunctionPass, and as far as I can tell is already late enough in the process that we are not interfering with optimisations.

Cheers

Some general questions about X-Ray
-------------
Is there a plan to make a separate mailing list or project around
this? Do you have a list of planned features?

Interesting question -- so far we haven't decided yet whether XRay will live as another project. I'm certainly open to this possibility. No concrete plans yet. It's an open question in the original RFC too (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html).

There's a white paper that details what we plan to implement out in the open (XRay: A Function Call Tracing System – Google Research). We're still working our way to getting to a full version as described in that white paper (basically blocked by my lack of familiarity with the LLVM codebase, and other n00b-y things :D).

There's not a concrete list of features, and we're certainly open to contributions from the community to add features that make sense. :slight_smile:

Graphics tools for analysis? AMD open sourced their CodeAnalyst - What
about some integration with that?

Thanks for the pointer! Yes, I'd love to have more integration with existing visualisation tools that read a particular well-documented format. Others have mentioned Jumpshot which might be a little dated, but still something that some people use for similar things.

Linux + Perf support (planned/exists)?

There are no plans to support the perf counter-lookups yet. Although I certainly think that's a nice source of data to be logging XRay-style.

FWIW, the API for XRay allows us to decouple the things being logged at function entry/exit. Getting performance counters at those points is a nice idea, it should be doable.

How much is this tied to something specific about Linux or it could be
easily ported to another platform?

Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner.

There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.

Ack - actually x86 probably makes some of this a lot easier. I'm
recently (frequently) annoyed (as hell) with how AArch64 isn't
exposing a bunch of basic things that I *want* (demand!) to know.

For example:
clock frequency on AArch64 + Linux == forget it. I had to use a
benchmark in order to basically brute force calculate some processors.
(They don't hard code it in /proc/cpuinfo or sys as you'd want) and
I'm really uncertain about what happens if there's stepping involved
(current AArch64 processors that I'm aware of don't have this feature
though) /* Maybe Google can help kick the linux devs into accepting
these patches */

For FBSD and iOS - I don't know how/if they expose this information..
(Is FBSD ported to AArch64 yet.. ?)

What's the benefit of this vs a stable and production ready tool like Dtrace?

I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is:

- Dtrace requires kernel-side support.
- XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same).
- XRay is selective and configurable by the application developer.
- XRay's cost is borne by the application only, and does not require stopping the application.

Just as you're instrumenting around functions - DTrace can similarly
inject "probes" (basically the same thing) - The other more common way
for DTrace to be used is for the application to not be changed and
it's just profiled. (Ok you must leave SP otherwise it won't work.. so
for the purist I guess you're relying on applications not to be
/fully/ optimized.. I forget if DWARF or CFW is required, but I don't
think so )

Dtrace also doesn't require stopping the application fwiw and you can
control probably a lot more of what's probed/instrumented. (There's a
full scripting langauge in order to control what you instrument
actually)

I'm not trying to take away from X-Ray, I think profiling is extremely
important, but I'm just wondering how much (if any) evaluation of
existing solutions was done. Maybe the DTrace licensing, CTF dep or
linux support was a dealbreaker, I just hate to see NIH when there's
good tools available that cover a significant amount of the needs or
more.

At the end of the day it's probably quite complementary, like all the
work you do for X-Ray, someone could likely leverage to automatically
inject DTrace probes and get a lot of the same stuff.

In your response you mentioned "O(100) cycles" - is that 100
instructions of "skid" between point of measurement? (Seems really
high for instrumenting, but maybe I'm mistaken..)

Lastly and again just side comments - in terms of data formats - JSON
has pretty good support, streams and compresses nicely, high
performance parses exist with liberal licensing as well as I think
there's a binary version of it. This could be handy *if* your app is
on Node B and you'd like the logs to be sent to Node A.

Anywho - cool work..

How much is this tied to something specific about Linux or it could be
easily ported to another platform?

Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner.

There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.

Ack - actually x86 probably makes some of this a lot easier. I'm
recently (frequently) annoyed (as hell) with how AArch64 isn't
exposing a bunch of basic things that I *want* (demand!) to know.

For example:
clock frequency on AArch64 + Linux == forget it. I had to use a
benchmark in order to basically brute force calculate some processors.
(They don't hard code it in /proc/cpuinfo or sys as you'd want) and
I'm really uncertain about what happens if there's stepping involved
(current AArch64 processors that I'm aware of don't have this feature
though) /* Maybe Google can help kick the linux devs into accepting
these patches */

I actually haven't gotten that far yet, to be honest -- I was still just trying to learn how to do the runtime patching and the instrumentation sleds faster. :slight_smile:

But it is good to know what other kinds of things I might run into when we cross that bridge. :smiley:

For FBSD and iOS - I don't know how/if they expose this information..
(Is FBSD ported to AArch64 yet.. ?)

I have no idea about FreeBSD. :confused:

What's the benefit of this vs a stable and production ready tool like Dtrace?

I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is:

- Dtrace requires kernel-side support.
- XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same).
- XRay is selective and configurable by the application developer.
- XRay's cost is borne by the application only, and does not require stopping the application.

Just as you're instrumenting around functions - DTrace can similarly
inject "probes" (basically the same thing) - The other more common way
for DTrace to be used is for the application to not be changed and
it's just profiled. (Ok you must leave SP otherwise it won't work.. so
for the purist I guess you're relying on applications not to be
/fully/ optimized.. I forget if DWARF or CFW is required, but I don't
think so )

XRay doesn't rely on DWARF, and has a separate section for the instrumentation maps. That section can also be removed from the final binary and loaded externally (that feature isn't implemented yet, but I'm working on making that happen). XRay also works even if the frame pointer is omitted which is a nice property. :slight_smile:

Dtrace also doesn't require stopping the application fwiw and you can
control probably a lot more of what's probed/instrumented. (There's a
full scripting langauge in order to control what you instrument
actually)

I'm aware that Dtrace can do what XRay does and more. I'm not so sure about the technical details of some of how it does its thing -- for example, XRay isn't sampling anything and instead is made to be logging stuff for offline reconstruction/analysis.

I'm not trying to take away from X-Ray, I think profiling is extremely
important, but I'm just wondering how much (if any) evaluation of
existing solutions was done.

That's fair -- XRay was developed at Google a long time ago, when Dtrace wasn't available. Our internal implementation has a lot of... internal'isms which integrates well into our... internal stuff. :slight_smile:

The landscape has changed though considerably since XRay was developed and when we decided to open-source an implementation of it. For example, clang wasn't even on the radar when some of the work on XRay started happening. Certainly there's lots of ways of doing this now, but the target at least for XRay is so that we can:

- Widen the set of platforms where we can use it. Linux+x86 is the "feature parity" point for us at least. And we're certainly interested in a lot more platforms now.

- Have better hooks into how much more efficiently we can make it. LLVM IR and the optimisation pass and analysis infrastructure gives us much more leeway into being smarter about certain instrumentation decisions.

- Make it more useful than just what our use-case has been. For example, we do performance analysis on long-running servers and want to be able to do instrumentation for only a certain period of time (not during the lifetime of the application). The logging implementation we have internally (that we're bringing out in the open) has a lot of cleverness to get the tradeoff between cost and coverage "just right" for our use-cases. There are other cases where this makes sense too and we recognise that being able to get a full execution trace (not sampled traces, not sampled profiles) for easier performance debugging of things like compilers, command-line tools, and other classes of applications does make sense too.

Maybe the DTrace licensing, CTF dep or
linux support was a dealbreaker, I just hate to see NIH when there's
good tools available that cover a significant amount of the needs or
more.

There are a couple of other things -- like the cost of when instrumentation is on. Dtrace currently requires a mode switch when the probe is encountered which is a non-trivial cost for the kinds of applications we've been debugging. Certainly that and the potential of affecting other applications/systems while Dtrace is enabled (and frankly the kinds of things you *can* do with it) becomes very hard to use at least for some of our systems where XRay has been used in terms of debugging.

/me is also not a fan of NIH. :slight_smile:

At the end of the day it's probably quite complementary, like all the
work you do for X-Ray, someone could likely leverage to automatically
inject DTrace probes and get a lot of the same stuff.

Agreed. Also consider the non-Linux systems, and those that have stricter requirements on resource consumption, etc. :slight_smile:

In your response you mentioned "O(100) cycles" - is that 100
instructions of "skid" between point of measurement? (Seems really
high for instrumenting, but maybe I'm mistaken..)

That's CPU cycles, and mostly the following:

For entry points:
- calling into a trampoline (relative jump)
- saving register states
- checking a global value if it's null (the logging intercept function pointer)
- loading register states
- returning

For exit points:
- jumping into a trampoline (relative jump)
- saving a couple of registers
- checking a global value if it's null (the logging intercept function pointer)
- loading register states
- returning

Now the logging intercept function should be tuned to do as little as possible. The implementation in the patches I mentioned to compiler-rt uses thread_local buffers and attempts to do "as little as possible" to write out the fixed sized log entries. Basically the cost of getting TSC and some stores.

Lastly and again just side comments - in terms of data formats - JSON
has pretty good support, streams and compresses nicely, high
performance parses exist with liberal licensing as well as I think
there's a binary version of it. This could be handy *if* your app is
on Node B and you'd like the logs to be sent to Node A.

Yeah, JSON is definitely one potential format. I've been meaning to use the Chrome profile viewer too. The problem has been the amount of data we're talking about here -- with 32-byte fixed-sized records currently per entry/exit, we're already at 606MB for a fully instrumented clang compiling a simple hello-world program (compressed is 81MB). I haven't tried writing this out in JSON, but I suspect that's multiples of the fixed-sized records. :smiley:

We can probably optimise that further with the stack de-duping support in the JSON format, but that's still a lot of segments/events, even if it could be converted to JSON. :slight_smile:

Anywho - cool work..

Thanks! :slight_smile: