XRay: Demo on x86_64/Linux almost done; some questions.

Hello,

Can I ask you why you chose to patch both function entrances and exits, rather than just patching the entrances and (in the patches) pushing on the stack the address of __xray_FunctionExit , so that the user function returns normally (with RETQ or POP RIP or whatever else instruction) rather than jumping into __xray_FunctionExit?

By patching just the function entrances, you avoid duplication of the function ID (which is currently taking space in the entrance and every exit) and duplication of the rest of the exit patch for every of the potentially many function exits.

This approach also avoids reporting exits for functions, for which entrances have not been reported because the functions were already running at the time patching happened.

This approach should also be faster because smaller code better fits in CPU cache, and patching itself should run faster (because there is less code to modify).

Or does this approach have some issues e.g. with exceptions, longjmp, debugger, etc.?

Below is an example patch code for ARM (sorry, no resource to translate to x86 myself). The compile-time stub (“sled”) would contain a jump as the first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction takes exactly 4 bytes, if not in Thumb etc. mode).

; Look at the disassembly to verify that the sled is inserted before the
; instrumented function pushes caller’s registers to the stack
; (otherwise r4 may not get preserved)
PUSH {r4, lr}
ADR lr, #16 ; relative offset of after_entrance_traced
; r4 must be preserved by the instrumented function, so that
; __xray_FunctionExit gets function ID in r4 too
LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism
; call __xray_FunctionEntry (returning to after_entrance_traced)
LDR pc, [pc, #0] ; use the address stored by the patching mechanism
.word <32-bit function ID>
.word <32-bit address of __xray_FunctionEntry>
.word <32-bit address of __xray_FunctionExit>
after_entrance_traced:
; Make the instrumented function think that it must return to __xray_FunctionExit
LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
; __xray_FunctionExit must “POP {r4, lr}” and in the end “BX lr”
; the body of the instrumented function follows

; Before patching (i.e. in sleds) the first instruction is a jump over the
; whole stub to the first instruction in the body of the function. So lr
; register stays original, thus no call to __xray_FunctionExit occurs at the
; the exit of the function, even if it is being patched concurrently.

Hello,

Can I ask you why you chose to patch both function entrances and exits, rather than just patching the entrances and (in the patches) pushing on the stack the address of __xray_FunctionExit , so that the user function returns normally (with RETQ or POP RIP or whatever else instruction) rather than jumping into __xray_FunctionExit?

By patching just the function entrances, you avoid duplication of the function ID (which is currently taking space in the entrance and every exit) and duplication of the rest of the exit patch for every of the potentially many function exits.

This approach also avoids reporting exits for functions, for which entrances have not been reported because the functions were already running at the time patching happened.

This approach should also be faster because smaller code better fits in CPU cache, and patching itself should run faster (because there is less code to modify).

Or does this approach have some issues e.g. with exceptions, longjmp, debugger, etc.?

The only issues I can think of are those of potentially interfering with and invalidating the stack pointer at runtime. Because the patching and determination of what the function id's are happen at runtime and not statically, we can only provide the space for the function id. In x86_64 this works out to only be just a few bytes. We also make sure XRay works even if frame pointers are omitted.

Another issue is that of tail call and sibling call optimisations. Because exiting these functions actually turn out to be jumps, we cannot be sure that the jumped-to function will clean up the stack appropriately.

As far as avoiding writing exit records without entry records, we deal with those externally (during analysis of the trace). It's important to know that when instrumentation is turned on (i.e. the log handler is not nullptr) that there was a function already running and that it exited at a given point in time. Especially when unwinding a deep function call stack, we can keep track of this as it's important information for analysis.

Consider the following case:

A() -> B() -> C() -> D() -> E()

When instrumentation is enabled after E() has started, we can see records of the following kind:

[timestamp, cpu] Exit E()
[timestamp, cpu] Exit D()
[timestamp, cpu] Exit B()
[timestamp, cpu] Exit A()

Note that the difference between "Exit E()" and "Exit D()" may not be 0 -- and that there may have very well been work happening between the exit of E() and exit of D(), and similarly up the stack.

Does this make sense?

Below is an example patch code for ARM (sorry, no resource to translate to x86 myself). The compile-time stub ("sled") would contain a jump as the first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction takes exactly 4 bytes, if not in Thumb etc. mode).

; Look at the disassembly to verify that the sled is inserted before the
; instrumented function pushes caller's registers to the stack
; (otherwise r4 may not get preserved)
PUSH {r4, lr}
ADR lr, #16 ; relative offset of after_entrance_traced
; r4 must be preserved by the instrumented function, so that
; __xray_FunctionExit gets function ID in r4 too
LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism
; call __xray_FunctionEntry (returning to after_entrance_traced)
LDR pc, [pc, #0] ; use the address stored by the patching mechanism
.word <32-bit function ID>
.word <32-bit address of __xray_FunctionEntry>
.word <32-bit address of __xray_FunctionExit>
after_entrance_traced:
; Make the instrumented function think that it must return to __xray_FunctionExit
LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX lr"
; the body of the instrumented function follows

; Before patching (i.e. in sleds) the first instruction is a jump over the
; whole stub to the first instruction in the body of the function. So lr
; register stays original, thus no call to __xray_FunctionExit occurs at the
; the exit of the function, even if it is being patched concurrently.

Cool, thanks -- we have an interim logging implementation for x86 which does the naïve logging to memory then flushes to disk regularly (I suspect you've already seen https://reviews.llvm.org/D21982). In that patch we have the very early beginnings of a test suite, so I think if you'd like to contribute the ARM implementation, that we can review that patch and land it to allow you to add tests and make sure that this also works on ARM.

I have zero experience with actually doing anything with ARM assembly and I'd appreciate all the help I can get to make XRay work on ARM too.

Cheers!

>
> Hello,
>
> Can I ask you why you chose to patch both function entrances and exits,
rather than just patching the entrances and (in the patches) pushing on the
stack the address of __xray_FunctionExit , so that the user function
returns normally (with RETQ or POP RIP or whatever else instruction) rather
than jumping into __xray_FunctionExit?
>
> By patching just the function entrances, you avoid duplication of the
function ID (which is currently taking space in the entrance and every
exit) and duplication of the rest of the exit patch for every of the
potentially many function exits.
>
> This approach also avoids reporting exits for functions, for which
entrances have not been reported because the functions were already running
at the time patching happened.
>
> This approach should also be faster because smaller code better fits in
CPU cache, and patching itself should run faster (because there is less
code to modify).
>
> Or does this approach have some issues e.g. with exceptions, longjmp,
debugger, etc.?
>

The only issues I can think of are those of potentially interfering with
and invalidating the stack pointer at runtime. Because the patching and
determination of what the function id's are happen at runtime and not
statically, we can only provide the space for the function id. In x86_64
this works out to only be just a few bytes. We also make sure XRay works
even if frame pointers are omitted.

Another issue is that of tail call and sibling call optimisations. Because
exiting these functions actually turn out to be jumps, we cannot be sure
that the jumped-to function will clean up the stack appropriately.

As far as avoiding writing exit records without entry records, we deal
with those externally (during analysis of the trace). It's important to
know that when instrumentation is turned on (i.e. the log handler is not
nullptr) that there was a function already running and that it exited at a
given point in time. Especially when unwinding a deep function call stack,
we can keep track of this as it's important information for analysis.

Consider the following case:

A() -> B() -> C() -> D() -> E()

When instrumentation is enabled after E() has started, we can see records
of the following kind:

[timestamp, cpu] Exit E()
[timestamp, cpu] Exit D()
[timestamp, cpu] Exit B()
[timestamp, cpu] Exit A()

Note that the difference between "Exit E()" and "Exit D()" may not be 0 --
and that there may have very well been work happening between the exit of
E() and exit of D(), and similarly up the stack.

Does this make sense?

Yes, this makes sense, thanks for the analysis. I'm going to investigate
later how to keep the stack consistent for unwinding (so to support C++
exceptions), e.g. by pretending that the __xray_FunctionExit call is the
destructor of the first object (local variable) on the stack.

> Below is an example patch code for ARM (sorry, no resource to translate
to x86 myself). The compile-time stub ("sled") would contain a jump as the
first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction
takes exactly 4 bytes, if not in Thumb etc. mode).
>
> ; Look at the disassembly to verify that the sled is inserted before the
> ; instrumented function pushes caller's registers to the stack
> ; (otherwise r4 may not get preserved)
> PUSH {r4, lr}
> ADR lr, #16 ; relative offset of after_entrance_traced
> ; r4 must be preserved by the instrumented function, so that
> ; __xray_FunctionExit gets function ID in r4 too
> LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism
> ; call __xray_FunctionEntry (returning to after_entrance_traced)
> LDR pc, [pc, #0] ; use the address stored by the patching mechanism
> .word <32-bit function ID>
> .word <32-bit address of __xray_FunctionEntry>
> .word <32-bit address of __xray_FunctionExit>
> after_entrance_traced:
> ; Make the instrumented function think that it must return to
__xray_FunctionExit
> LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
> ; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX lr"
> ; the body of the instrumented function follows
>
> ; Before patching (i.e. in sleds) the first instruction is a jump over
the
> ; whole stub to the first instruction in the body of the function. So
lr
> ; register stays original, thus no call to __xray_FunctionExit occurs
at the
> ; the exit of the function, even if it is being patched concurrently.

Cool, thanks -- we have an interim logging implementation for x86 which
does the naïve logging to memory then flushes to disk regularly (I suspect
you've already seen https://reviews.llvm.org/D21982).

No, I wasn't aware of that patch, thanks for pointing out!

In that patch we have the very early beginnings of a test suite, so I
think if you'd like to contribute the ARM implementation, that we can
review that patch and land it to allow you to add tests and make sure that
this also works on ARM.

I have zero experience with actually doing anything with ARM assembly and
I'd appreciate all the help I can get to make XRay work on ARM too.

Yes, I am trying to port XRay on LLVM to ARM, but I'm just starting with
LLVM.

It may well be slower. Larger CPUs tend to track the call stack in
hardware and returning to an address pushed manually is an inevitable
branch mispredict in those cases.

Cheers.

Tim.

Thanks for pointing this out, Tim. Then maybe this approach is not the best choice for x86, though ideally measuring is needed, it is just that on ARM the current x86 approach is not applicable because ARM doesn’t have a single return instruction (such as RETQ on x86_64), furthermore, the return instructions on ARM can be conditional.

I have another question: what happens if the instrumented function (or its callees) throws an exception and doesn’t catch? I understood that currently XRay will not report an exit from this function in such case because the function doesn’t return with RETQ, but rather the stack unwinder jumps through functions calling the destructors of local variable objects.

If so, why not to instrument the functions by placing a tracing object as the first local variable, with its constructor calling __xray_FunctionEntry and destructor calling __xray_FunctionExit ? Perhaps this approach requires changes in the front-end (C++ compiler, before emitting IR).

Cheers.

Yes, this makes sense, thanks for the analysis. I'm going to investigate later how to keep the stack consistent for unwinding (so to support C++ exceptions), e.g. by pretending that the __xray_FunctionExit call is the destructor of the first object (local variable) on the stack.

I think there's a better way to do this, which involves actually instrumenting the catch handlers too -- so that we can determine which functions have been exited, and which ones have resumed execution. This requires a bit more work and few more bytes in a few more places.

Although I would be interested in finding out what you come up with at least that doesn't involve instrumenting the throw/catch points.

No, I wasn't aware of that patch, thanks for pointing out!

No worries, I'd appreciate your review of that too. :slight_smile:

In that patch we have the very early beginnings of a test suite, so I think if you'd like to contribute the ARM implementation, that we can review that patch and land it to allow you to add tests and make sure that this also works on ARM.

I have zero experience with actually doing anything with ARM assembly and I'd appreciate all the help I can get to make XRay work on ARM too.
Yes, I am trying to port XRay on LLVM to ARM, but I'm just starting with LLVM.

Awesome! Let me know how else I can help.

Cheers

Thanks for pointing this out, Tim. Then maybe this approach is not the best choice for x86, though ideally measuring is needed, it is just that on ARM the current x86 approach is not applicable because ARM doesn't have a single return instruction (such as RETQ on x86_64), furthermore, the return instructions on ARM can be conditional.

I think this is a flaw in the current implementation. I long suspected that other platforms didn't have a single return instruction like x86, and I think it's worth changing this.

In an early version of the implementation that made it into LLVM, I made the determination whether an instruction was a return instruction, and make that determination per-platform.

We should make that happen so we can easier support ARM.

I have another question: what happens if the instrumented function (or its callees) throws an exception and doesn't catch? I understood that currently XRay will not report an exit from this function in such case because the function doesn't return with RETQ, but rather the stack unwinder jumps through functions calling the destructors of local variable objects.

If so, why not to instrument the functions by placing a tracing object as the first local variable, with its constructor calling __xray_FunctionEntry and destructor calling __xray_FunctionExit ? Perhaps this approach requires changes in the front-end (C++ compiler, before emitting IR).

That's right -- the approach I've been thinking about (but haven't gotten to implementing yet, because of other priorities) has been to have special instrumentation at the throw and catch points. Similarly understanding things like tail call exits and sibling call optimisations and instrumenting those exit points is another thing that is yet to be implemented.

Unfortunately changing the front-end to add an RAII or guard object will cause a whole lot of problems too. For instance, optimisations on the IR can move instructions before/after the guard object initialisation and then we end up with less accurate instrumentation. This also introduces potentially unwanted semantics and changes the stack layout and affecting the performance of an application with XRay sleds but instrumentation not enabled.

Hi Dean,

I have a question about the following piece of code in compiler-rt/trunk/lib/xray/xray_trampoline_x86.S :

movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax
testq %rax, %rax
je .Ltmp0

// assume that %r10d has the function id.
movl %r10d, %edi
xor %esi,%esi
callq *%rax

What happens if someone unsets the handler function (i.e. calls __xray_remove_handler() ) or changes the handler (i.e. calls __xray_set_handler() with a different pointer to function) between “movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax” and “callq *%rax” ? I understood that the old handler will still be called, but isn’t it a problem? What if the code removing the handler starts destructing some objects after that, so that a handler call would result in a crash or other bad things?

The same concern for analogous code in __xray_FunctionExit .

Cheers,
Serge

That would be unfortunate. :slight_smile:

Yes, you're right, it will be called despite having the handler be un-set from the global atomic. Though I'd suggest that log handler implementations really shouldn't be crashing, and should assume that it will be called at any given time (from multiple threads, so it should also be thread-safe). Consider also that we're taking a function pointer by design -- assuming that it will be a pointer to a function that will only deal with globals and the explicitly passed in arguments.

An implementer of the log handler should really be responsible for "internal signalling" -- i.e. implementing it so that any required synchronisation happens internally.

Does this help?

PS. I think more definitive documentation for this is due, let me think about putting something some documentation on how to use XRay as it is currently implemented once we have the naive in-memory logging implementation land (https://reviews.llvm.org/D21982).

Cheers

-- Dean

Yes, double-checking on the handler side whether its resources are still operational, e.g. by calling weak_ptr::lock() may be a better idea than elimination of spurious handler calls in XRay at a cost of heavy synchronization objects. Ok, let’s live with spurious handler calls.

Cheers,
Serge

Hi Dean,

I have a question for 32-bit platforms. I see in the code that you used the following in compiler-rt/trunk/lib/xray/xray_interface_internal.h :

struct XRaySledEntry {
uint64_t Address;
uint64_t Function;
unsigned char Kind;
unsigned char AlwaysInstrument;
unsigned char Padding[14]; // Need 32 bytes
};

And the peer code in llvm/trunk/lib/Target/X86/X86MCInstLower.cpp :

void X86AsmPrinter::EmitXRayTable() {
if (Sleds.empty())
return;
if (Subtarget->isTargetELF()) {
auto *Section = OutContext.getELFSection(
“xray_instr_map”, ELF::SHT_PROGBITS,
ELF::SHF_ALLOC | ELF::SHF_GROUP | ELF::SHF_MERGE, 0,
CurrentFnSym->getName());
auto PrevSection = OutStreamer->getCurrentSectionOnly();
OutStreamer->SwitchSection(Section);
for (const auto &Sled : Sleds) {
OutStreamer->EmitSymbolValue(Sled.Sled, 8);
OutStreamer->EmitSymbolValue(CurrentFnSym, 8);
auto Kind = static_cast<uint8_t>(Sled.Kind);
OutStreamer->EmitBytes(
StringRef(reinterpret_cast<const char *>(&Kind), 1));
OutStreamer->EmitBytes(
StringRef(reinterpret_cast<const char *>(&Sled.AlwaysInstrument), 1));
OutStreamer->EmitZeros(14);
}
OutStreamer->SwitchSection(PrevSection);
}
Sleds.clear();
}

So useful part of your entry is 18 bytes, and you add 14 bytes of padding to achieve 32-byte alignment. But for 32-bit CPUs I understood that XRaySledEntry::Address and XRaySledEntry::Function can be 32-bit too, right? So the entry can fit 16 bytes, with 10 useful bytes and 6 bytes of padding. Can I use 16-byte entries or is there some external (OS? ELF? Linker?) requirement that one entry must be 32-bytes, or aligned at 32 bytes, etc.?

Thanks,
Serge

Good question Serge -- technically there isn't any specific external requirement here, but that supporting 32-bit x86 isn't a priority for me right now. I suspect it's possible to support 32-bit x86 with a similar approach (modifying both the LLVM back-end to emit the right assembly for 32-bit x86 and maybe for 32-bit non-x86, as well as compiler-rt to work on 32-bit x86) but that I haven't had the time to explore this yet.

I'm positive that it's doable though and that we know the right places where the changes have to happen. There's some work being done on the tooling side of things and I suspect once we have a standardised log format, things like endianness and sizes of certain values start becoming an issue. For example, if I build a log analysis tool to work on 64-bit systems, whether it should be able to handle log files generated in 32-bit systems (and be able to read differently-sized instrumentation map sections).

For this reason, I think it's better to stay consistent and forward-compatible (i.e. not have special cases for 32-bit platforms). I do think it's important to support 32-bit systems too, and I'd be more than happy to review patches that would make it possible (until say I get the time to support 32-bit platforms later on).

Does that make sense?

Cheers

-- Dean

I think that 32-bit systems (especially ARM) may be short on memory so doubling the size of the table containing (potentially) all the functions may give a tangible overhead. I would even align the entries to 4 bytes (so 12 bytes per entry) on 32-bit platforms and to 8 bytes (so 24-bytes per entry) on 64-bit platforms, to improve CPU cache hits. What do you think?

Cheers,
Serge

It should work, but I'm a little wary about painting ourselves into a corner -- for example, I'm already designing some extensions to this table to represent other kinds of information (that might fit into what currently is padding, or use some more bits of the bytes in the entries). I might need to either have some sort of versioning introduced into this table so that tools reading the same table (not necessarily the runtime) can determine what kinds of information will be available in the entries. Although you're right, maybe 14 bytes of padding is a little excessive but I'm being very conservative here. :smiley:

Basically the trade-off is between binary/resident size and tooling support.

For 32-bit systems I think its possible to have smaller entries at the cost of making the tooling and runtime a bit more complex -- i.e. there's going to be a special implementation for x86 32-bit and 64-bit, arm 32-bit and 64-bit, etc. Then we think about the tools external to the runtime that will access the same table. We can probably write tools that extract the table from binaries in COFF, ELF, and MachO then turn those into a canonicalised instrumentation map. We don't even deal with endianness (reading an instrumentation map for a 64-bit binary from a 32-bit system -- what order to the bytes come in and how should the 32-bit system interpret those values). These issues start expanding the tooling support matrix.

Maybe that's inevitable, depending on which platforms the members of the LLVM community would like to see XRay be available. :smiley:

As far as CPU cache hit/misses are concerned, I personally don't think it's that crucial to get the table packed so we utilise the cache more -- the patching code runs through this table sequentially, and the cost is actually in the sys calls making code pages writeable (and marking the page dirty and causing all sorts of more important issues). I think cache hits/misses are the least of the problems here. :wink:

I am open to suggestions here too, so I'd be happy to shave a few more bytes off if that means the impact of having XRay tables in the binary is minimised.

I suppose I should detail a bit more what other things will be coming up, which should help with the overall design direction here. I'll update within the week about some other things we're looking to bring upstream with more details as soon as I'm done fleshing those out. :slight_smile:

Stay tuned!

Cheers

-- Dean

Hi Dean,

Do you have any estimates on when https://reviews.llvm.org/D21982 will reach mainline? (As I understood it’s not yet there, looking at http://llvm.org/svn/llvm-project/compiler-rt/trunk ). I would like to test ARM port of XRay, so ready logging would be handful.

Thanks,
Serge

This is currently dependent on reviewers -- right now only David Majnemer and Mehdi Amini have been reviewing in the past few weeks (I understand both of them are currently swamped). If you could have a look at that and do a review as well, that'd be great to move it forward.

Just out of curiosity, are you sending the ARM port upstream too? If so I'd be really interested in seeing how that's gone.

Cheers

-- Dean

Yes, I will review your patches.
Yes, I plan to deliver the ARM port to mainline eventually. The basic port is ready, but I need to test it first and I’m new to LLVM, so I don’t know how long non-coding procedures will take. Sure, I’ll add you as a reviewer.

Cheers,
Serge

Awesome stuff, thanks Serge!

Cheers

-- Dean