[XRay] RFC: LLVM-side Changes for nop-sleds

Hi llvm-dev (cc google-xray),

As a follow-up to the first XRay RFC [0] introducing the technology, I’ve been able to recently implement a functional prototype of the major parts of the XRay functionality [1]. This RFC is limited to exploring potential alternatives to the current LLVM-side changes, with the interest of getting clear guidance for landing the changes first in LLVM.

Background / Current Implementation

I have a few meta questions here.

Why should LLVM (and from the patch it seems Clang) favor one
instrumentation system -- in this case the XRay instrumentation system
vs. many others that may be possible to add to upstream?

It seems GCC has -finstrument-functions that call into cyg_....
functions. Poor naming choice, but I suppose one thing would be to use
those names. Or better yet, provide a way in commandline to say what
functions are for entry, and what are for exit.

How is this different from hot patching that exists in Windows? I
suppose this feature makes it more accessible?

I hope we can change the name of this thing if it were to be added to
something generic that doesn't tie us to the runtime libraries needed
for XRay specifically.

Thanks for the questions Hayden, please see in-line below some responses.

I have a few meta questions here.

Why should LLVM (and from the patch it seems Clang) favor one
instrumentation system – in this case the XRay instrumentation system
vs. many others that may be possible to add to upstream?

I don’t think there’s any intent to exclude any existing or alternative instrumentation systems from LLVM. At least from our proposal, we’re making sure we’re playing well with any existing current implementations already in LLVM/Clang and others that might come along.

It seems GCC has -finstrument-functions that call into cyg_…
functions. Poor naming choice, but I suppose one thing would be to use
those names. Or better yet, provide a way in commandline to say what
functions are for entry, and what are for exit.

I thought Clang already supported this as an option?

How is this different from hot patching that exists in Windows? I
suppose this feature makes it more accessible?

The differences are multi-fold. Some of them that I can list down are:

  • XRay aims to not change the functionality of the application/function being instrumented. The sole goal of the XRay instrumentation points are to allow for dynamic enabling/disabling of the instrumentation, and only using the instrumentation points that have been inserted by the compiler. With hot-patching in Windows, as far as I can tell the intent is to update the implementation of a function at runtime completely not just for instrumentation. You can say that XRay may be implemented in a similar manner by re-writing the function being instrumented at runtime and hot-patching the original function implementation, but we’ve chosen not to do that for efficiency reasons (trade-off between cost of instrumentation when “off” and when “on”).

  • XRay has a very specific goal, which is to generate function call traces for performance debugging. Other instrumentation systems will have different goals, and the hot-patching mechanism is just one of those techniques useful for achieving the various goals. We certainly can allow other uses for XRay (i.e. in the prototype implementation, we have hooks to allow changing what function is called when an instrumentation point is encountered at runtime) but the immediate goal is for generating traces that can be analysed offline.

I hope we can change the name of this thing if it were to be added to
something generic that doesn’t tie us to the runtime libraries needed
for XRay specifically.

I agree we should be able to share common infrastructure in LLVM for adding instrumentation points (there’s an interesting RFC recently for CSI) and I’m all for making it easier to implement things like XRay through the common infrastructure. There’s certainly been talk about consolidating the different options for adding instrumentation into a coherent set of flags in Clang, but I haven’t quite seen talk about common instrumentation infrastructure support in LLVM. My hope is, if this is something the community will find useful, that we can gain consensus or at least share a clear direction. I’d be happy to do the work if that means we can get XRay functionality supported as one of the many possible implementations in LLVM.

I’m happy to have a conversation about being able to make alternative instrumentation systems easier to implement with the work to support XRay in LLVM, if that makes it at least clear that XRay isn’t being proposed as the “one true way” for instrumenting Clang/LLVM-built binaries. I’m even willing to try and iterate on the interfaces and/or implementations in LLVM to make XRay-like things be built on top of LLVM.

As for naming, I think being able to specify from a command-line (of Clang, or some other llvm-* tools) the string ‘xray’ makes it easier to search for, document, and “teach”. The vision is, if it’s possible, to have many of these instrumentation implementations live under a single flag like ‘-finstrument=’. For now though any talk of that might be premature if we’re only going to have ‘-finstrument=profile’ and ‘-finstrument=xray’.

Does that make sense?

Cheers

I don’t think that I’ve yet seen an explanation of why you need the NOPs. DTrace stopped using them a long time ago, for two reasons:

1) The increased code size caused a noticeable increase in i-cache misses, even when instrumentation was not actively being used. This caused a noticeable probe effect (macroscopic observable performance artefacts even when no probes are active) and caused a lot of push-back in adoption.

2) On all of the architectures where we support DTrace (currently, I believe, x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible to do the same thing by moving one of the instructions in the function prolog into the generated trampoline for the instrumentation.

I could understand wanting something more like patchpoints if you want to be able to instrument in the middle of a function (along the lines of TESLA or CSI), but if you’re just tracing function entry and exit then it doesn’t seem like the best solution.

David

Thanks for the questions David – the short version of the answer is that DTrace (last I checked) requires some help from the Kernel, while XRay is self-contained in the application.

All of your points above are valid, and DTrace is a really powerful tool for debugging a lot of performance issues. XRay has a few things that differentiate it from systems like DTrace though:

  1. Because we insert the instrumentation sleds in specific functions that fit a certain criteria (i.e. more selectively) instead of instrumenting every function, we pay the cost of the instrumentation being off only on functions that are instrumented. The combination of the changes in the front-end to support attributes/annotations in the code to force-instrument/-inhibit instrumentation gives control to the application developer, allows us to limit the cost along a spectrum – full coverage costs more, selective coverage can be tuned, and explicit annotations provide precise control of the instrumentation.

  2. The cost of the instrumentation at run-time is O(100) cycles for the “null-logging” case (mov + trampoline jump, atomic load and check if not zero). All the cost of instrumentation is within the process’ address space (in-memory log) when on – no additional overheads external to the application.

  3. The runtime implementation for logging described in the white paper allows us to balance the coverage (number of instrumentation events we get) with overheads (the amount of resources used in the logging implementation). Because we log only very specific things (function id, tsc deltas in most cases, type of event) and have heuristics to condense the information we keep (i.e. if entry-exit pairs are under epsilon, we can omit the entry entirely), we don’t need to be quite as complete when logging and instead move a lot of the logic in reconstruction/analysis of the generated traces.

There are certainly other approaches to doing selective instrumentation, and then externally signalling/trapping (with environment support) when probing. XRay moves this needle towards having the instrumentation and collection and even signalling into the application. This makes sense if you’re deploying the application on a system that doesn’t have DTrace and still be able to isolate the costs of instrumentation just to the application.

I’ll admit that I’ll need to read a lot more about how DTrace manages to keep the costs of probes low enough that it could be turned on dynamically without stopping the process, and without having to intercept more events than actually necessary (i.e. only on certain functions, and only when it’s on) to be able to provide a more complete answer.

Does this help?

Cheers