RFC: Processor Trace Support in LLDB

Hi all,

Here I propose, along with Greg Clayton, Processor Trace support for
LLDB. I’m attaching a link to the document that contains this proposal
if that’s easier to read for you:
https://docs.google.com/document/d/1cOVTGp1sL_HBXjP9eB7qjVtDNr5xnuZvUUtv43G5eVI/edit#heading=h.t5mblb9ugv8f
<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1cOVTGp1sL-5FHBXjP9eB7qjVtDNr5xnuZvUUtv43G5eVI_edit-23heading-3Dh.t5mblb9ugv8f&d=DwMGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=erxV6KMIZvIQjyWYW8YpOiKz-WqJt4giKQA34YMHsRY&m=DuuwXHUQJpW4TcCay4hPsBund-eBI2uVaVimqEPsp5k&s=o6vqoYYbn-Tz_d34hoLJvWhEnnhracOO6yDsMzq8wR0&e=>.
Please make any comments in this mail list.

If you want to quickly know what Processor Trace can do, you can read
this https://easyperf.net/blog/2019/08/23/Intel-Processor-Trace
<https://urldefense.proofpoint.com/v2/url?u=https-3A__easyperf.net_blog_2019_08_23_Intel-2DProcessor-2DTrace&d=DwMGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=erxV6KMIZvIQjyWYW8YpOiKz-WqJt4giKQA34YMHsRY&m=DuuwXHUQJpW4TcCay4hPsBund-eBI2uVaVimqEPsp5k&s=iaErHaf8byXlZb1YFUk0BpQ-duMhNouUUMyktLm3soQ&e=>.

Any comments are appreciated, especially the ones regarding the
commands the user will interact with.

Thanks,

Walter Erquinigo.

# RFC: Processor Trace Support in LLDB

# What is processor tracing?

Processor tracing works by capturing information about the execution
of a process so that the control flow of the program can be
reconstructed later. Implementations of this are Intel Processor Trace
for X86, x86_64
([https://software.intel.com/content/www/us/en/develop/blogs/processor-tracing.html](https://software.intel.com/content/www/us/en/develop/blogs/processor-tracing.html))
and ARM CoreSight for some ARM devices
([https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace](https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace)).

As a clarifying example, with these technologies it’s possible to
trace all the threads of a process, and after the process has
finished, reconstruct every single instruction address each thread has
executed. This could include some additional information like
timestamps, async CPU events, kernel instructions, bus clock ratio
changes, etc. On the other hand, memory and registers are not traced
as a way to limit the size of the trace.

# Intel Processor Trace as the first implementation

We’ll focus on Intel Processor Trace (Intel PT), but in a generic way
so that in the future similar technologies can be onboarded in LLDB.

Intel PT has the following features:

* Control flow tracing in a highly encoded format

* 3% to 5% slowdown when capturing

* No memory nor registers captured

* Kernel tracing support

* Timestamps of branches are produced, which can be used for profiling

* Adjustable size of trace buffer

* Supported on most Intel CPUs since 2015

* X86 and x86_64 only

* Official support only on Linux

* Basic support on Windows

* Decoding/analysis can be done on any operating system

A very nice introduction to Intel PT can be found
[https://software.intel.com/content/www/us/en/develop/blogs/processor-tracing.html](https://software.intel.com/content/www/us/en/develop/blogs/processor-tracing.html)
and [https://easyperf.net/blog/2019/08/23/Intel-Processor-Trace](https://easyperf.net/blog/2019/08/23/Intel-Processor-Trace).
Totally recommended to fully grasp the impact of this project.

More technical details are in
[https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt](https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/perf-intel-pt.txt).

Even more technical details are in the processor manual
[https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf)

# Basic Definitions

* Trace file: A trace file basically contains the information of the
target addresses of each branch or jump within the program execution
in a highly encoded format.

* Capturing: The act of tracing a process and producing a trace file.

* Decoding: Decoding outputs a sequential list of instructions given
a trace file and the images of a process. Decoding is generally an
offline step as it’s expensive.

* Trace buffer: In order to limit the size of the trace, an
on-memory circular buffer can be used, keeping the most recent
branching information. The trace file is a snapshot of this.

* Gap: Sporadically some branching information can be lost or be
impossible to decode, which creates a gap in the reconstructed control
flow.

# New LLDB features

* Loading traces: We want to load traces potentially from other
computers, and have LLDB symbolicating it. A flow like the following
should be possible \

    ```

    $ trace load /path/to/trace

    $ trace dump --instructions

    pid: '1234', tid: '1981309'

      a.out`main

      [57] 0x400549 <+13>: movl %eax, -0x4(%rbp)

      a.out`bar()

      [56] 0x40053b <+46>: retq

      [55] 0x40053a <+45>: leave

      [54] 0x400537 <+42>: movl -0x4(%rbp), %eax

      [53] 0x400535 <+40>: jle 0x400525 ; <+24> at
main.cpp:7

      [52] 0x400531 <+36>: cmpl $0x3, -0x8(%rbp)

      [51] 0x40052d <+32>: addl $0x1, -0x8(%rbp)

      [50] 0x40052a <+29>: addl %eax, -0x4(%rbp)

      a.out`foo()

      [49] 0x400567 <+15>: retq

      [48] 0x400566 <+14>: popq %rbp

      [47] 0x400563 <+11>: movl -0x4(%rbp), %eax

      [46] 0x40055c <+4>: movl $0x2a, -0x4(%rbp)

              ...

          [1] 0x400559 <+1>: movq %rsp, %rbp

          [0] 0x400558 <+0>: pushq %rbp

          // Format:

    ```

    ` // [instruction index] &lt;instruction disassembly> \

`Notice the resemblance to loading a core file, but in this case we
can get the control flow, printed in reverse order in this example.

* Decoding: LLDB can use libipt
([https://github.com/intel/libipt](https://github.com/intel/libipt)),
which is the low level Intel PT decoding library, to convert trace
files into instructions.

* Showing instructions: LLDB can output the list of instructions of
the control flow, as shown above

* Showing function calls: Similarly, LLDB can print a hierarchical
view of the function calls. A flow like this should be possible: \

    ```

    $ trace load /path/to/trace

    $ trace dump --function-calls

    pid: '1234', tid: '1981309'

      [50] a.out`bar() 0x40052a

      [45] a.out`zaz() 0x400558

      [40] a.out`baz() 0x400559

      [30] a.out`foo() 0x400567

    ```

    ` [0] a.out`main 0x400000 \

\

`This functionality allows LLDB to reconstruct the call stack at any
point and potentially do reverse debugging.

* Capturing: LLDB can also do the Intel PT capturing of a live
process, so that at any stop the user can do reverse stepping or
simply inspect the trace. A possible flow is:

    ```

    $ <stopped at main>

    $ b main.cpp:50

    $ trace start intel-pt // this initiates the tracing

    $ continue

    $ <stopped at main.cpp:50>

    $ trace dump --instructions

pid: '1234', tid: '1981309'

      a.out`main

      [57] 0x400549 <+13>: movl %eax, -0x4(%rbp)

      a.out`bar()

      [56] 0x40053b <+46>: retq

      [55] 0x40053a <+45>: leave

    ```

    Displaying time information: If the trace contains timing
information, we could also display it along with each instruction,
e.g.

    ```

    a.out`bar()

    [56: 1600284226]: 0x40053b <+46>: retq

    ...

    [4: 1600284200]: 0x40053a <+45>: leave

    // Format:

    // [instruction index: unix timestamp] <instruction disassembly>

    ```

    Furthermore, we could display the time spent in each function.

# Future LLDB features

* Reverse Stepping: With the hierarchical reconstruction of the
function calls, along with the individual instructions, LLDB can offer
reverse stepping. Operations like reverse-next, reverse-step-out,
reverse-continue could work by traversing the trace. We plan to work
on this once the features presented above are in place.

* Trace-based profiling

* SB API of the mentioned features

# Why is this useful?

* Bug root-causing:

    * For example, a crash in a production Release build ends up
being analyzed with logs, a coredump, and a stack trace. Logs are not
comprehensive, and a stack trace only contains the final state of the
program. Providing the user with the control flow of the last
milliseconds gives a tremendous amount of information that is
game-changing in root-causing issues. It could be said that the user
goes from a single stack trace to a list of stack traces.

    * Reverse stepping enables more efficient debugging, as it
reduces the number of iterations to efficiently root-cause bugs. More
often than not, reproducing a bug takes a considerable amount of time,
and the user needs to reproduce it several times until the correct
breakpoints are hit. This takes a considerable amount of time. Giving
the user the information of what has been executed so far can help
them figuring out where’s the location to place a breakpoint, or to
very easily figure out what went wrong.

* Low cost: unlike other similar technologies, Intel PT has an
almost negligible performance cost regardless of whether the build is
optimized or not, making it appealing to a wide range of scenarios.

* This infrastructure can be used for enabling other tools like
non-sample-based profilers with instruction-level accuracy, security
analyzers that check if certain memory regions are executed, and trace
comparators, which could find bugs by comparing similar traces.

# Goals of this document:

* Gather feedback on the basic Trace implementation, which would
include the following basic operations: loading, decoding, and
dumping.

* Create awareness about this work.

* Get a green light on the current set of patches implementing this
feature starting with https://reviews.llvm.org/D85705.

# Non-Goals:

* Discuss how reverse-stepping will be implemented. This can be left
for another discussion. Once the Trace architecture is in place and
robust, reverse-stepping can then be discussed, as it’s a more
controversial change than this one.

* Explain thoroughly Intel PT.

# Existing Tool Support

* GDB has a basic implementation of the features above
([https://sourceware.org/gdb/onlinedocs/gdb/Process-Record-and-Replay.html](https://sourceware.org/gdb/onlinedocs/gdb/Process-Record-and-Replay.html))
and some ideas are taken from there.

* Perf is a standalone tool that can do capturing and decoding.

* The Linux kernel has full support for doing capturing at thread,
logical cpu or cgroup level.

* Intel developed a basic version of Intel PT support in LLDB as an
external plugin.
[https://reviews.llvm.org/D33674](https://reviews.llvm.org/D33674),
[https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b](https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b).

# New Trace Commands

Based on this patch
[https://reviews.llvm.org/D85705](https://reviews.llvm.org/D85705),
there would be a common Trace class along with plug-in
implementations.

## Trace loading

### $ trace load /path/to/trace/settings/file.json

As decoding a trace requires the images of the object files, the trace
files and some CPU information, it’s convenient to have a JSON file
that describes an entire trace session. The following JSON schema
could be used.


{

"trace": {

   … // plug-in specific information

 },

 "processes": [      // process information common to all trace plug-ins

   {

     "pid": integer,

     "triple": string, // llvm-triple

     "threads": [

       {

         "tid": integer,

         "traceFile": string

       }

     ],

     "modules": [

       {

         "systemPath": string, // original path of the module at runtime

         "file"?: string, // copy of the file if not available at "systemPath"

         "loadAddress": string, // string address in hex or decimal form

         "uuid"?: string,

       }

     ]

   }

 ]

}

// Notes:

// All paths are either absolute or relative to the settings file.

**Corefiles:**

We plan to extend this schema to support corefiles, but we would leave
it out of this discussion, as can be easily seen as an extension of
this basic schema.

**Implementation details:**

To make our first implementation easier, we’ll ask for an individual
trace file per thread. This is the simpler collection mode for Intel
PT.

The entire json file will be translated into a Trace object, which
contains the trace information of each thread and process in it.

Each process in the json file will be represented as a new Target.
Similarly, threads and modules for each target will be created
following the json file. This is very similar to what loading a
minidump or coredump does.

Each Target will be associated with a Trace, and multiple targets can
share the same Trace. The contract is that Trace is assumed to end at
the current PC of each thread of the target.

### $ trace schema &lt;plug-in>

This command prints the JSON schema of the trace settings file for the
provided plug-in. It would output something similar to this


{

"trace": {

   "type": "intel-pt",

   "pt_cpu": {

     "vendor": "intel" | "unknown",

     "family": integer,

     "model": integer,

     "stepping": integer

   }

 },

 "processes": [

   {

     "pid": integer,

     "triple": string, // llvm-triple

     "threads": [

       {

         "tid": integer,

         "traceFile": string

       }

     ],

     "modules": [

       {

         "systemPath": string, // original path of the module at runtime

         "file"?: string, // copy of the file if not available at "systemPath"

         "loadAddress": string, // string address in hex or decimal form

         "uuid"?: string,

       }

     ]

   }

 ]

}

// Notes:

// All paths are either absolute or relative to the settings file.

### $ trace dump [--verbose] [-t tid1] [-t tid2] ...

Print the trace information corresponding to the provided thread ids
of the currently selected target, which would mainly include the same
information as the trace settings file. If no tid is provided, the
currently selected thread is used. This would be useful for debugging.
The information would be like

  Modules:

    &lt;module info like systemPath, file, load address, uuid, size>

  Threads:

    &lt;thread info like location of trace file, number of
instructions (if already decoded), number of function calls (if
already decoded)>

If &lt;--verbose> is passed, the original settings.json file is printed as well.

## Decoder-based commands

The following commands require decoding the trace and are of the form.
“trace dump &lt;action> [-t &lt;tid>]”. If tids are not specified,
then the current thread or the current target will be used.

### $ trace dump --instructions [-t &lt;tid>] [-c &lt;count> = 10] [-o
&lt;offset> = 0]

This command would print the last &lt;count> instructions starting at
the given offset from the last instruction in the trace. The output
would be similar to that of the “disassembly” command and would
include timing information if available.


    $ trace dump --instructions -c 5

    pid: '1234', tid: '1981309'

      a.out`main

      [57] 0x400549 <+13>: movl   %eax, -0x4(%rbp)

      a.out`bar()

      [56] 0x40053b <+46>: retq

      [55] 0x40053a <+45>: leave

      [54] error -13. 'no memory mapped at this address'

      a.out`foo()

      [53] 0x400567 <+15>: retq

Repeating the command would continue printing where it was left off in
the last run.

**Implementation details:**

Each instruction output by the decoder is either an actual instruction
or an error. An error can be caused due to a collection error (e.g.
internal CPU buffer overflow error) or a decoding error (e.g. the
image of an object file is missing while decoding). These errors
represent gaps in the trace and the user should know about them, so we
print them accordingly in this dump.

Each instruction (including errors) has an index in the decoded trace,
and serves as a checkpoint.

### $ trace dump --function-calls [-t &lt;tid>] [-c &lt;count> = 10]
[-o &lt;offset> = 0] [--flat]

This command would print the hierarchical list of function calls.
Similar to the “--instructions” command, it would show the last
&lt;count> function calls with the given offset from the last
instructions. Timing information would be included if available.


    $ trace dump --function-calls

    pid: '1234', tid: '1981309'

      [50]     a.out`bar()         0x40052a

      [45]       a.out`zaz()       0x400558

      [40]     a.out`baz()         0x400559

      [30]   a.out`foo()           0x400567

      [0]  a.out`main              0x400000

Repeating the command would continue printing where it was left off in
the last run.

If &lt;--flat> is passed, then instead of a hierarchical view, a flat
list would be produced.

## Capturing command

### $ trace start &lt;plugin_name> [-t &lt;tid>] [--all] [-b
&lt;buffer_size_in_KB>]

This command will start tracing the given thread of the currently
selected target, or all the threads of that target if “--all” is
passed. If “--all” is passed, any thread created after this command
will also be traced automatically.

Besides, the optional -b parameter can define the size of each trace
buffer to be created. I haven’t yet decided a default one, but 1M
might be acceptable, as it traces around 1 million instructions on
average according to Intel, and that’s more than enough for a useful
analysis.

For an initial implementation, the plugin_name parameter will be
required (e.g. intel-pt). Later a more automated mechanism for finding
the right plugin can be implemented.

**Implementation notes:**

There’s already a basic implementation in lldb as an external plugin.
It’s in [https://reviews.llvm.org/source/llvm-github/browse/master/lldb/tools/intel-features/intel-pt/](https://reviews.llvm.org/source/llvm-github/browse/master/lldb/tools/intel-features/intel-pt/)
created by [https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b](https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b).
It hasn’t received much attention and has been mostly unmaintained
since it was created. It’s already capable of tracing a given thread
and collecting the trace buffer. We plan to reuse that logic, which is
already working.

A Trace object will be created and will be associated with the current Target.

Any interaction with trace, like dumping instructions, will trigger a
fetch of the most recent trace buffer, unless it hasn’t changed.

When multiple threads are traced, each one will have its own trace
buffer, as sharing one buffer in multiple threads requires knowing
when each context switch happened so that the decoded trace can be
split correctly among threads. This is beyond the scope of the initial
version of this project.

### $ trace save /path/to/file.json [--copy-images]

This creates a bundle trace with settings saved in the given json file
for the current process. By default, it doesn’t create any copy of the
images loaded on the process, unless the “--copy-images” parameter is
specified. That parameter is useful for analyzing the trace in a
machine other than where it was captured.

# Remote Protocol Changes

No remote protocol changes are required, as
[https://reviews.llvm.org/D33674](https://reviews.llvm.org/D33674) and
[https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b](https://reviews.llvm.org/rG307db0f8974d1b28d7b237cb0d50895efc7f6e6b)
already created them some years ago.

# Build Requirements

In order to build LLDB with this support, it has to be linked with a
build of libipt
[https://github.com/intel/libipt](https://github.com/intel/libipt),
which is the decoder.

# Operating System Requirements for Collection/Tracing

Collection can only be done on linux if the file
/sys/bus/event_source/devices/intel_pt/type is defined. The logic
gating this feature is already checked in and defined in
[https://reviews.llvm.org/D33674](https://reviews.llvm.org/D33674).

# Testing

It’s fortunately straightforward to test this feature. It’s possible
to capture traces with perf or with the future “trace start” / ”trace
save” commands and create trace bundles with their corresponding
settings .json file. Analyzing those traces should give the same
results on any machine, making testing deterministic.
[https://reviews.llvm.org/D85705](https://reviews.llvm.org/D85705) and
descendents already implement some deterministic tests.

Hi Walter & Greg,

Thanks for sharing this RFC, and for your work in this area.

Is it possible to decode a small portion of an Intel PT trace file quickly, say, in a few milliseconds? This would be useful if tracing were done in ringbuffer mode, or if the event the user is interested in debugging (along with its relevant execution history) is known to occur at the end of the trace. The user could potentially choose which subset of the trace to decode, and re-decode a different subset if more context is needed.

What mechanisms are available for discerning the root cause of a gap? Does the Intel PT decoder have internal consistency checks that can diagnose hardware bugs (or decoder bugs, for that matter)?

Also, when a gap occurs, perhaps it’s possible that the instructions leading up to the gap are not accurate. E.g., if the decoding process desyncs from the trace file while disassembling, it’s possible to accidentally follow (or ignore) a branch. Are there measures to detect/erase those inaccurate instructions prior to a gap?

Also, how should a gap be represented in the debugger output? E.g., if a gap is encountered while dumping instructions, should the debugger print <gap: instruction unknown>?

Imho it’s important to nail down a user interface metaphor for navigating/exploring a trace before adding any ‘dump’-like commands. I don’t think we’ve done that yet.

I’m not trying to hold up work: I think these ‘dump’ subcommands can be hidden, or maybe they could print a ‘for lldb developers only’ warning until we have a better idea of how users will want to explore a trace.

One potential UI metaphor is a slider: the user can see where (which instruction index) in the decoded trace instruction stream they are, and they can move the slider (jump backwards/forwards in the instruction stream) as desired. Wherever they are stopped, they can get an accurate backtrace, look at the call (or line, or instruction-level) execution history, peek ahead at future calls, etc. (Reverse) stepping/continuing could be scene as moving the slider more or less quickly. Maybe it’d be useful to mark a spot to get back to it later.

I’m sure there are other ways to look at a trace. E.g. you could have a view that shows how often each function/line is executed, or you could have an annotated CFG view.

(Stepping back a bit – I realize these comments are somewhat forward-looking / potentially out of scope for your initial patches. Still, I feel it’s worth thinking about early on.)

All this sounds good to me with the caveat that, as mentioned above, we probably should indicate to users that the trace dump facility is not stable / likely to change.


lldb-dev mailing list
lldb-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

vedant

Thanks for your comments, I’ll reply here:

Is it possible to decode a small portion of an Intel PT trace file quickly, say, in a few milliseconds? This would be useful if tracing were done in ringbuffer mode, or if the event the user is interested in debugging (along with its relevant execution history) is known to occur at the end of the trace. The user could potentially choose which subset of the trace to decode, and re-decode a different subset if more context is needed.

Yes, it’s totally possible. I’ll dig a little deep into the Intel PT trace structure. A trace is made of a bunch of packets, where some are synchronization packets (PSB packets). You can pick an arbitrarily sync packet, and from that point on start decoding. This means that it’s possible to decode backwards (i.e. find the last sync packet, decode the packets up to the end of the trace, then move to the previous sync packet, decode until the former packet, and so on). This is very valuable and I’ll keep it in mind for the implementation.

What mechanisms are available for discerning the root cause of a gap? Does the Intel PT decoder have internal consistency checks that can diagnose hardware bugs (or decoder bugs, for that matter)?

Yes. Whenever there’s a decoding error, the libipt decoder notifies us of what the error is. You can check the DecodeInstructions function in https://reviews.llvm.org/D87589 if you are interested, although it’s not a light read.

Also, when a gap occurs, perhaps it’s possible that the instructions leading up to the gap are not accurate. E.g., if the decoding process desyncs from the trace file while disassembling, it’s possible to accidentally follow (or ignore) a branch. Are there measures to detect/erase those inaccurate instructions prior to a gap?

I don’t think this can happen. When an instruction can’t be decoded, the decoder moves to the next synchronization point and resumes decoding from that point. Interestingly, it’s possible to configure how often synchronization packets are produced. IIRC you could even request one sync packet per CPU cycle, leading to small gaps. If this configuration is not specified, the CPU itself decides when to produce these packets, which tend to be every few KB of data.

Also, how should a gap be represented in the debugger output? E.g., if a gap is encountered while dumping instructions, should the debugger print <gap: instruction unknown>?

Imho it’s important to nail down a user interface metaphor for navigating/exploring a trace before adding any ‘dump’-like commands. I don’t think we’ve done that yet.

We definitely should let the user know of these events. When dumping traces in this WIP diff https://reviews.llvm.org/D87730, I’m already showing the user these gaps and a reason why they failed.

[4] 0x400529 <+28>: cmpl   $0x3, -0x8(%rbp)
[3] error -13. 'no memory mapped at this address'
[2] 0x40052d <+32>: jle    0x400521

There’s nothing more useful to do besides showing this information somehow. It’s information lost, so I think it’s fine as it is :slight_smile:

However, the real interesting point to discuss is how we could implement reverse debugging under these circumstances. What if you do reverse-next and the trace has a gap but a bunch of instructions before that gap, should we abort the reverse-next and tell the user that there’s a gap, then the user somehow has to move backwards in another way if that’s the intention? Or should we just move backwards skipping the gap and printing an error message that there’s a gap? Probably I’d choose the latter over the former, but I imagine some people would prefer the first. I’d prefer to leave this for a future discussion. Dropping some bit of information here, several IDEs like VSCode already support reverse-debugging controls, so it would make sense to make the default behavior of the LLDB implementation follow those controls, and create some other commands for the folks who want something different.

I’m not trying to hold up work: I think these ‘dump’ subcommands can be hidden, or maybe they could print a ‘for lldb developers only’ warning until we have a better idea of how users will want to explore a trace.

I envision these dump commands as the most inefficient way to explore a trace, and I wouldn’t add much more to them. I think that the best way to explore is with reverse debugging (e.g. place a breakpoint, do reverse-continue, stop at that breakpoint, move forward and backwards, print the stack trace, move to another breakpoint, etc.) A trace has so much information but the user already has an idea of where they want to look at when root causing a bug, so breakpoints are the easiest interface for the user to tell LLDB what they are interested in.

One potential UI metaphor is a slider: the user can see where (which instruction index) in the decoded trace instruction stream they are, and they can move the slider (jump backwards/forwards in the instruction stream) as desired. Wherever they are stopped, they can get an accurate backtrace, look at the call (or line, or instruction-level) execution history, peek ahead at future calls, etc. (Reverse) stepping/continuing could be scene as moving the slider more or less quickly. Maybe it’d be useful to mark a spot to get back to it later.

We are agreeing on this. I think that reverse debugging controls in the UI are a very good start for that.

I’m sure there are other ways to look at a trace. E.g. you could have a view that shows how often each function/line is executed, or you could have an annotated CFG view.

Yes! This becomes highly important, especially if there’s timing information associated. You could make visualizations over time, with callstacks, statistics, etc. My intention is to eventually flesh out those cool features.

Thanks,

  • Walter Erquinigo

Hi Walter,

I’ve only done a brief scan of the document but, in general, I’m favorable of the goals, aim, and approach. Something I think would be good would be to compare/contrast against rr as an “exploring alternatives” section of the document. I think the document should also be made available/adapted to be part of the documentation on “why lldb is implementing this feature/what it can be used for/why”.

Thanks so much for starting this and looking forward to the work and collaboration.

-eric

Hi Walter,

I've only done a brief scan of the document but, in general, I'm favorable
of the goals, aim, and approach. Something I think would be good would be
to compare/contrast against rr as an "exploring alternatives" section of
the document. I think the document should also be made available/adapted to
be part of the documentation on "why lldb is implementing this feature/what
it can be used for/why".

Thanks so much for starting this and looking forward to the work and
collaboration.

-eric

Same. I am really excited that this work will open up possibilities for
reverse debugging, which is the most important factor impeding me from
migrating (from gdb) to lldb :slight_smile:

For unit tests, a json format tracing record is probably convenient, but
for practical usage we may need a compacter format, e.g. Cap'n Proto
used by rr
(Eyes Above The Waves: Stabilizing The rr Trace Format With Cap’n Proto)
Hope the framework can be easily adapted to such a compact format.

Hi Eric, thanks for the feedback

Something I think would be good would be to compare/contrast against rr as an “exploring alternatives” section of the document.

I’ll include that. I’ve done some comparative research on rr and I think I can provide valuable input.

I think the document should also be made available/adapted to be part of the documentation on “why lldb is implementing this feature/what it can be used for/why”.

I think this information is scattered throughout the document, but I’ll make sure to answer this in one of the first paragraphs.

Thanks!

  • Walter

Thanks for your feedback Fangrui, I’ve just been checking Capn’ Proto and it looks really good. I’ll keep it in mind in the design and see how it can optimize the overall data transfer.

  • Walter

Thank you for writing this Walter. I think this document will be a
useful reference both now and in the future.

The part that's not clear to me is what is the story with multi-process
traces. The file format enables those, but it's not clear how are they
going be created or used. Can you elaborate more on what you intend to
use those for?

The main reason I am asking that is because I am thinking about the
proposed command structure. I'm wondering if it would not be better to
fit this into the existing target/process/thread commands instead of
adding a new top-level command. For example, one could imagine the
following set of commands:

- "process trace start" + "thread trace start" instead of "thread trace
[tid]". That would be similar to "process continue" + "thread continue".
- "thread trace dump [tid]" instead of "trace dump [-t tid]". That would
be similar to "thread continue" and other thread control commands.
- "target create --trace" instead of "trace load". (analogous to target
create --core).
- "process trace save" instead of "trace save" -- (mostly) analogous to
"process save-core"

I am thinking this composition may fit in better into the existing lldb
command landscape, though I also see the appeal in grouping everything
trace-related under a single top-level command. What do you think?

The main place where this idea breaks down is the multi-process traces.
While we could certainly make "target create --trace" create multiple
targets, that would be fairly unusual. OTOH, the whole concept of having
multiple targets share something is a pretty unusual thing for lldb.
That's why I'd like to hear more about where you want to go with this idea.

Hi Pavel, thanks for the comments. I’ll reply inline

The part that’s not clear to me is what is the story with multi-process
traces. The file format enables those, but it’s not clear how are they
going be created or used. Can you elaborate more on what you intend to
use those for?

Something we are doing at Facebook is having a global Intel PT collector that can trace all processes of a given machine for some seconds. This can produce a multi-process trace. I imagine these traces won’t ever be generated by LLDB, though. Having one single json trace file for this is going to be useful for sharing the trace more easily. Multi-process tracing is also something you can do with the perf tool, so It’s not uncommon.

There are some technical details that are worth mentioning as well. Intel PT offers two main modes of tracing: single thread tracing and logical CPU tracing.

  • The first one is the easiest to implement, but it requires having a dedicated buffer per thread, which can consume too much RAM if there are thousands of threads traced. It also adds a little bit of performance cost, as the kernel disables and enables tracing whenever there’s a context switch.
  • The other mode, logical CPU tracing, traces all the activity in one logical core and uses one single buffer. Also it is more performant as the kernel doesn’t disable tracing intermittently. Sadly, that trace contains no information regarding context switches, so a separated context switch trace is used for splitting this big trace into per-thread subtraces. The decoder we are implementing eventually will be able to do this splitting, and it will require being fed with the information of all processes. This is also a reason why allowing multi-process traces is important.

Regarding the commands structure, I’d prefer to keep it under “trace” for now, because of the multi-process case and because we still have no users that can report feedback. Very soon we’ll start building some tools around this feature, so we’ll have more concrete experiences to share. Then it’ll be good to sync up and revisit the structure.

Btw, the gdb implementation of this kind of tracing is under the “record” main command (https://sourceware.org/gdb/current/onlinedocs/gdb/Process-Record-and-Replay.html). I think this allows for some flexibility, as each trace plugin has different characteristics.

I’m not sure how Cap’n Proto comes into play here. The way I understand it, the real data is contained in a separate file in the specialized intel format and the json is just for the metadata. I’d expect the metadata file to be small even for enormous traces, so I’m not sure what’s to be gained by optimizing it.

I didn’t mention it in that email, but there is some additional information that we’ll eventually include in the traces, like the context-switch trace I mentioned above. I think that we could probably use Cap’n Proto for cases like this. We might also not use it at all as well, but it was nice to learn about it and keep it in mind.

Thanks,
Walter

Thank you for writing this Walter. I think this document will be a
useful reference both now and in the future.

The part that's not clear to me is what is the story with multi-process
traces. The file format enables those, but it's not clear how are they
going be created or used. Can you elaborate more on what you intend to
use those for?

Mainly for system trace kinds of things where an entire system gets traced.

The main reason I am asking that is because I am thinking about the
proposed command structure. I'm wondering if it would not be better to
fit this into the existing target/process/thread commands instead of
adding a new top-level command. For example, one could imagine the
following set of commands:

- "process trace start" + "thread trace start" instead of "thread trace
[tid]". That would be similar to "process continue" + "thread continue".
- "thread trace dump [tid]" instead of "trace dump [-t tid]". That would
be similar to "thread continue" and other thread control commands.
- "target create --trace" instead of "trace load". (analogous to target
create --core).
- "process trace save" instead of "trace save" -- (mostly) analogous to
"process save-core"

I am thinking this composition may fit in better into the existing lldb
command landscape, though I also see the appeal in grouping everything
trace-related under a single top-level command. What do you think?

The main place where this idea breaks down is the multi-process traces.
While we could certainly make "target create --trace" create multiple
targets, that would be fairly unusual. OTOH, the whole concept of having
multiple targets share something is a pretty unusual thing for lldb.
That's why I'd like to hear more about where you want to go with this idea.

I kind of see tracing has having two sides:
1 - post mortem tracing for individual or multiple processes
2 - live debug session tracing for being able to see how you crashed where trace data is for current process only

For post mortem tracing, the trace top level command seemed to make sense here because there are no other target commands that act on more than one target. So "trace load" makes sense to me here for loading one or more traces. The idea is the trace JSON file has enough info to completely load up the state of the trace so we can symbolicate, dump and step around in history. So I would vote to keep "trace load" at the very least because it can create one or more targets. Options can be added to display the processes if needed:

(lldb) trace list <trace-json-file>

But we could move "trace dump" over into "target trace dump" or "process trace dump" since that is effectively how we are coding these patches.

For live debugging where we gather trace data through the process plug-in, we will have a live process that may or may not have trace data. If tracing isn't available we will not be able to dump anything. But I would like to see process/thread commands for this scenario:

- process trace start/stop (only succeeds if we can gather trace data through the process plug-in)
- thread trace start/stop (which can succeed only if current tracing can enable tracing for only one thread)

Not sure if we need "process trace save" or "thread trace save" as the saving can be done as an option to "process trace stop --save /path/to/save"

So I am all for fitting these commands in where they need to go.

After a chat with Greg, we agreed on this set of commands

trace load /path/to/json process trace start/stop process trace save /path/to/json thread trace start/stop thread trace dump [instructions | functions]

We spoke a bit after Panel’s comments which made sense and we propose the commands Walter sent below. Let us know what everyone thinks of this organization of the command structure!

I had accepted the patch https://reviews.llvm.org/D86670, but then marked as “Request Changes” while we discuss the commands in this RFC after new comments came in.

Thanks. The new commands look good to me.

The multi-process trace concept is interesting. I don't question its
usefulness -- I am sure it can be useful for various kinds of analysis
(though I've never used that myself). I am wondering though about how to
represent this thing in lldb, as we don't really have anything close to
the concept of "debugging" all processes on a given system.

The only thing that comes close is probably the kernel-level debugging.
One idea (which has just occurred to me, so it may not be good) might be
to make these traces behave similarly to that. I.e., create a single
target/process with one "thread" per physical cpu, and then have a
special "os plugin" like thing which would present individual
process/threads.

That would have the advantage of maintaining the one trace-one target
invariant and also would preserve the information about relative timings
of individual "processes". I think that wuold be an interesting way to
view these things, but I don't know if it would be the best one...

pl

After a chat with Greg, we agreed on this set of commands

trace load /path/to/json process trace start/stop process trace save
/path/to/json thread trace start/stop thread trace dump [instructions |
functions]

Thanks. The new commands look good to me.

Great, we can move the “trace dump” over to “thread trace dump” for https://reviews.llvm.org/D86670 and keep that moving.

The multi-process trace concept is interesting. I don’t question its
usefulness – I am sure it can be useful for various kinds of analysis
(though I’ve never used that myself). I am wondering though about how to
represent this thing in lldb, as we don’t really have anything close to
the concept of “debugging” all processes on a given system.

The only thing that comes close is probably the kernel-level debugging.
One idea (which has just occurred to me, so it may not be good) might be
to make these traces behave similarly to that. I.e., create a single
target/process with one “thread” per physical cpu, and then have a
special “os plugin” like thing which would present individual
process/threads.

I don’t know enough about how trace data is stored or annotated after the raw data is pulled from the cores, but to make it useful it must be able to be associated with processes and threads somehow otherwise it would be just a bunch of addresses that would all overlap between many processes.

That would have the advantage of maintaining the one trace-one target
invariant and also would preserve the information about relative timings
of individual “processes”. I think that wuold be an interesting way to
view these things, but I don’t know if it would be the best one…

I might suggest that each trace plug-in should do its best to represent processes and threads as separate entities so that they all remain separate. What ever data starts out as should be abstracted and I think I would rather see individual processes with their threads if that is possible to do, but I am just thinking of this with just a bit of knowledge tracing data. I think many chip makers create these trace formats and they are designed from a “trace a core” perspective, but if we can tame this data and present it as users would want to see it instead of trying to represent it as the data is stored, I think we will have a compelling trace feature in our debugger.

Greg