[RFC] Fast Conditional Breakpoints (FCB)

mib · August 14, 2019, 8:52pm

Hi everyone,

I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.

Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.

The goal of my internship project is to make conditional breakpoints faster by:

1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.

This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.

This feature is described on the [LLDB Project page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.

Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.

## High Level Design

Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.

First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.

To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.

## Implementation Details

To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.

The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.

      Inferior Binary                                     Trampoline

>            .            |                      +-------------------------+
>            .            |                      |                         |
>            .            |           +--------->+   Save RegisterContext  |
>            .            |           |          |                         |

+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          |  Build Arguments Struct |
>                         >           >          >                         >

+-------------------------+           |          +-------------------------+
>                         +-----------+          |                         |
>   Branch to Trampoline  |                      |  Call Condition Checker |
>                         +<----------+          |                         |

+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          | Restore RegisterContext |
>                         >           >          >                         >

+-------------------------+           |          +-------------------------+
>            .            |           |          |                         |
>            .            |           +----------+ Run Copied Instructions |
>            .            |                      |                         |
>            .            |                      +-------------------------+

Once the execution reaches the Trampoline, several steps need to be taken.

LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression's argument structure.

Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
__BreakpointInjectedSite__ class, which builds the conditional expression for
all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
variables in the `$__lldb_arg` structure.

BreakpointInjectedSites are created in the __Process__ if the user enables
the `-I | --inject-condition` flag when setting or modifying a breakpoint.
Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.

Several parts of lldb have to be modified to implement this feature:

- **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
                  related class (Breakpoint, BreakpointLocation,
                  BreakpointSite, BreakpointOptions)
- **Plugins**: Added ObjectFileTrampoline for the unwinding
                  Added x86_64 ABI support (FCBT setup & safety checks)
- **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
- **Target**: Added BreakpointInjectedSite creation to `Process` to insert
                  the jump to the FCBT
                  Added the Trampoline module creation to `ABI` for the
                  unwinding

### Breakpoint Option

Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
"breakpoint modify" to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.

They can be enabled when using `-I | --inject-condition` option. These options
can also be enabled using the Python Scripting Bridge public API, using the
`InjectCondition(bool enable)` method on an __SBBreakpoint__ or
__SBBreakpointLocation__ object.

This feature is intended to be used with condition expression
(`-c <expr> | --condition <expr>`), but also other conditions types such as:

- Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
- Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
- Thread Queue Name

### Trampoline

To be able to inject the condition, we need to re-route the debugged program's
execution flow. This parts is handled in the __Trampoline__, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program's original behavior.

The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:

1. Save all the registers by pushing them to the stack
2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
3. Check the condition by calling the injected UserExpression and execute a
trap if the condition is true.
4. Restore register context
5. Rewrite and run original copied instructions operands

All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation ...).

Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).

### BreakpointInjectedSite

To handle the Fast Conditional Breakpoint setup, LLDB uses
__BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different `UserExpression` to resolve variables
and inject the condition checker.

#### Condition Checker

Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn't have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.

Once all the conditions are fetched, LLDB will create a __UserExpression__
with the injected trap instruction.

When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB's
__BreakpointSiteList__.

When generated, this is what the condition checker looks like:

void $__lldb_expr(void *$__lldb_arg)
{
    /*lldb_BODY_START*/
    if (condition) {
        __builtin_debugtrap();
    };
    /*lldb_BODY_END*/
}

#### Argument Builder

The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.

Usually the expression evaluator invokes the __Materializer__ which fetches
the variables values and fills the `$__lldb_arg` structure. But since we don't
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the `$__lldb_arg` pointer, before running the condition
checker.

That's where the __Argument Builder__ comes in.

The argument builder uses an `UtilityFunction` to generate the
`$__lldb_create_args_struct` function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.

`$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:

1. It takes advantage of the fact that LLDB saved all the registers to the
stack and map them in an `register_context` structure.

    ```cpp
    typedef struct {
    // General Purpose Registers
    } register_context;
    ```

2. Using information from the variable resolver, it allocates a memory stub
   that will contain the used variable addresses.
3. Then, it will use the register values and the collected metadata to
   compute the used variable address and write that into the
   newly allocated structure.
4. Finally the allocated structure is returned to the trampoline, which will
   pass it as an argument to the injected condition checker.

Since `$__lldb_create_args_struct` uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.

#### Variable Resolver

When creating a Fast Conditional Breakpoint, the __debug info__ tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (__Step 3 of the Argument Builder__).

LLDB will first get the `DeclMap` from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable's __DWARF Expression__.

DWARF expressions explain how to reconstruct a variable's values using DWARF
operations.

The reason why LLDB needs the register context is because local variable are
often at an offset of the __Stack Base Pointer register__ or written across
one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
to create the `$__lldb_arg` structure.

Since all the registers are already mapped to a structure, I should
be able to support more __DWARF Operations__ in the future.

After collecting some metrics on the __Clang__ binary, built at __-O0__,
the debug info shows that __99%__ of the most used DWARF Operations are :

DWARF Operation| Occurrences |
---------------|---------------------------|
DW\_OP_fbreg | 2 114 612 |
DW\_OP_reg | 820 548 |
DW\_OP_constu | 267 450 |
DW\_OP_addr | 17 370 |

__Top 4__ | __3 219 980 Occurrences__ |
---------------|---------------------------|
__Total__ | __3 236 859 Occurrences__ |

Those 4 operations are the one that I'll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.

### Unwinders

When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
    frame #1: 0x0000000100105028

This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user's source code frame.

The injected UserExpression already has a valid stack frame, but it doesn't
have any information about its caller, the Trampoline. In order to unwind to
the user's code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the __ObjectFileTrampoline__ in our case.

It will contain several pieces of information such as, the module's name and
description, but most importantly the module __Symbol Table__ that will have
the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
__Text Section__ that will tell the unwinder the trampoline bounds.

Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.

This is what the backtrace looks like after hitting the injected trap:

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
    frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
  * frame #2: 0x0000000100000f5b main`main at main.c:7:23

For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.

A `debug-injected-condition` setting will allow to stop at the FCBT and show
all the elided frames.

### Instruction Shifter (WIP)

Because some instructions might use operands that are at an offsets relative
to the program counter, copying the instructions to a new location might
change their meaning:

LLDB needs to patch each instruction with the right offset.
This is done using `LLVM::MCInst` tool in order to detect the instructions
that need to be rewritten.

## Risk Mitigation

The optimization relies heavily on code injection, most of which is
architecture specific. Because of this, overwriting the instructions
can fail depending of the breakpoint location, e.g.:

- If the overwritten instructions contains indirection (branch instructions).
- If the overwritten instructions are a branch target.
- If there is not enough instructions to insert the branch instruction (x86_64)

If the setup process fails to insert the Fast Conditional Breakpoint, it will
fallback to the legacy behavior, and warn the user about what went wrong.

One way to mitigate those limitations would be to use code instrumentation
to detect if it's safe to set a Fast Condition Breakpoint at a certain
location, and hint the user to move the FCB before or after the location where
it was set originally.

## Prototype Code

I submitted my patches ([1](reviews.llvm.org/D66248), [2](reviews.llvm.org/D66249),
[3](reviews.llvm.org/D66250)) on Phabricator with the prototype.

## Feedback

Before moving forward I'd like to get the community's input. What do you
think about this approach? Any feedback would be greatly appreciated!

Thanks,

Finkel_Hal_J · August 14, 2019, 10:42pm

First, this all sounds really useful.

Out of curiosity, how do these statistics change if you compile Clang
with -O1? Many of my users need to debug slightly-optimized code.

-Hal

labath · August 15, 2019, 5:10pm

Hello Ismail, and wellcome to LLDB. You have a very interesting (and not entirely trivial) project, and I wish you the best of luck in your work. I think this will be a very useful addition to lldb.

It sounds like you have researched the problem very well, and the overall direction looks good to me. However, I do have some ideas suggestions about possible tweaks/improvements that I would like to hear your thoughts on. Please find my comments inline.

Hi everyone,

I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.

Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.

The goal of my internship project is to make conditional breakpoints faster by:

1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.

This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.

This feature is described on the [LLDB Project page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.

Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.

## High Level Design

Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.

First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.

To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.

## Implementation Details

To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.

The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.
       Inferior Binary                                     Trampoline

>            .            |                      +-------------------------+
>            .            |                      |                         |
>            .            |           +--------->+   Save RegisterContext  |
>            .            |           |          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          |  Build Arguments Struct |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>                         +-----------+          |                         |
>   Branch to Trampoline  |                      |  Call Condition Checker |
>                         +<----------+          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          | Restore RegisterContext |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>            .            |           |          |                         |
>            .            |           +----------+ Run Copied Instructions |
>            .            |                      |                         |
>            .            |                      +-------------------------+
Once the execution reaches the Trampoline, several steps need to be taken.

LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression's argument structure.

Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
__BreakpointInjectedSite__ class, which builds the conditional expression for
all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
variables in the `$__lldb_arg` structure.

BreakpointInjectedSites are created in the __Process__ if the user enables
the `-I | --inject-condition` flag when setting or modifying a breakpoint.
Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.

Several parts of lldb have to be modified to implement this feature:

- **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
                   related class (Breakpoint, BreakpointLocation,
                   BreakpointSite, BreakpointOptions)
- **Plugins**: Added ObjectFileTrampoline for the unwinding
                   Added x86_64 ABI support (FCBT setup & safety checks)
- **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
- **Target**: Added BreakpointInjectedSite creation to `Process` to insert
                   the jump to the FCBT
                   Added the Trampoline module creation to `ABI` for the
                   unwinding

### Breakpoint Option

Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
"breakpoint modify" to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.

They can be enabled when using `-I | --inject-condition` option. These options
can also be enabled using the Python Scripting Bridge public API, using the
`InjectCondition(bool enable)` method on an __SBBreakpoint__ or
__SBBreakpointLocation__ object.

This feature is intended to be used with condition expression
(`-c <expr> | --condition <expr>`), but also other conditions types such as:

  - Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
  - Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
  - Thread Queue Name

### Trampoline

To be able to inject the condition, we need to re-route the debugged program's
execution flow. This parts is handled in the __Trampoline__, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program's original behavior.

The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:

1. Save all the registers by pushing them to the stack
2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
3. Check the condition by calling the injected UserExpression and execute a
    trap if the condition is true.
4. Restore register context
5. Rewrite and run original copied instructions operands

All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation ...).

Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).

### BreakpointInjectedSite

To handle the Fast Conditional Breakpoint setup, LLDB uses
__BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different `UserExpression` to resolve variables
and inject the condition checker.

#### Condition Checker

Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn't have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.

Once all the conditions are fetched, LLDB will create a __UserExpression__
with the injected trap instruction.

When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB's
__BreakpointSiteList__.

When generated, this is what the condition checker looks like:
void $__lldb_expr(void *$__lldb_arg)
{
     /*lldb_BODY_START*/
     if (condition) {
         __builtin_debugtrap();
     };
     /*lldb_BODY_END*/
}
#### Argument Builder

The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.

Usually the expression evaluator invokes the __Materializer__ which fetches
the variables values and fills the `$__lldb_arg` structure. But since we don't
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the `$__lldb_arg` pointer, before running the condition
checker.

That's where the __Argument Builder__ comes in.

The argument builder uses an `UtilityFunction` to generate the
`$__lldb_create_args_struct` function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.

`$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:

1. It takes advantage of the fact that LLDB saved all the registers to the
    stack and map them in an `register_context` structure.

     ```cpp
     typedef struct {
     // General Purpose Registers
     } register_context;
     ```
     2. Using information from the variable resolver, it allocates a memory stub
    that will contain the used variable addresses.
3. Then, it will use the register values and the collected metadata to
    compute the used variable address and write that into the
    newly allocated structure.
4. Finally the allocated structure is returned to the trampoline, which will
    pass it as an argument to the injected condition checker.

I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.

Since `$__lldb_create_args_struct` uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.
#### Variable Resolver

When creating a Fast Conditional Breakpoint, the __debug info__ tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (__Step 3 of the Argument Builder__).

LLDB will first get the `DeclMap` from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable's __DWARF Expression__.

DWARF expressions explain how to reconstruct a variable's values using DWARF
operations.

The reason why LLDB needs the register context is because local variable are
often at an offset of the __Stack Base Pointer register__ or written across
one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
to create the `$__lldb_arg` structure.

Since all the registers are already mapped to a structure, I should
be able to support more __DWARF Operations__ in the future.

After collecting some metrics on the __Clang__ binary, built at __-O0__,
the debug info shows that __99%__ of the most used DWARF Operations are :

>DWARF Operation| Occurrences |
>---------------|---------------------------|
>DW\_OP_fbreg | 2 114 612 |
>DW\_OP_reg | 820 548 |
>DW\_OP_constu | 267 450 |
>DW\_OP_addr | 17 370 |

> __Top 4__ | __3 219 980 Occurrences__ |
>---------------|---------------------------|
> __Total__ | __3 236 859 Occurrences__ |

Those 4 operations are the one that I'll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.

### Unwinders

When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
   * frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
     frame #1: 0x0000000100105028
This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user's source code frame.

The injected UserExpression already has a valid stack frame, but it doesn't
have any information about its caller, the Trampoline. In order to unwind to
the user's code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the __ObjectFileTrampoline__ in our case.

It will contain several pieces of information such as, the module's name and
description, but most importantly the module __Symbol Table__ that will have
the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
__Text Section__ that will tell the unwinder the trampoline bounds.

Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.

This is what the backtrace looks like after hitting the injected trap:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
     frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
     frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
   * frame #2: 0x0000000100000f5b main`main at main.c:7:23
For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.

A `debug-injected-condition` setting will allow to stop at the FCBT and show
all the elided frames.

Regarding unwinding, I am wondering whether we really need to do anything really special. It sounds to me that if we try a little bit harder then we could make the trampoline code look very much like a signal handler, and have it be treated as such. Then the only special thing we would need to do is to hide the topmost trampoline code somewhere higher up in the presentation layer.

I am imagining the trampoline code could look something like this (excuse my bad assembly, I haven't written that in a while):
pushq %rax
pushq %rbx
...
leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
movq %rsp, %r11 # void *args
subq $SIZE_OF_ARGS, %rsp
movq %r10, %rdi
movq %r11, %rsi
callq __build_args # __build_args(const void *registers, void *args)
movq %r11, %rdi
callq __lldb_expr # __lldb_expr(void *args)
test %al, %al
jz .Ldone
trap_opcode:
int3
.Ldone:
addq $SIZE_OF_ARGS, %rsp
pop everything, execute displaced instructions and jump back

I think this trampoline is pretty similar to what you're proposing, but there are a couple of subtle differences:
- the args structure is allocated on the stack - I already spoke about that
- the testing of the condition happens inside the trampoline
I think this second item has several advantages. Firstly, this means that we hit the breakpoint, we only have one extra frame on the stack. So even if we don't do any extra work in the debugger to hide this stuff, we don't clutter the stack too much.

Secondly, this means we can avoid the "dissasemble and scan for trap opcode" step, which is kind of a hack -- after all, we generated these instructions, so we should _know_ where the trap opcode is. This way, you can emit a special symbol (trap_opcode label in the example above), that lldb can then search for, and know it's location exactly.

And lastly, and this is the most important advantage IMO, is that we are in full control of the kind of unwind info we generate for the trampoline. We can generate the proper eh_frame info for this trampoline which would correctly describe the locations of the registers of the previous frame, so that lldb would automatically be able to find them and display them properly when you do for instance "register read" with the parent frame selected. Hopefully, all this would take is a couple of well-placed .cfi assembler instructions.

Here, I'm imagining we could use the MC layer in llvm do do this thing, either by feeding it a raw assembler string, or by using it's c++ api, whichever is easier. Then we could feed this to the normal jit together with the compiled c++ expression and it would link it all together and load it into memory.

### Instruction Shifter (WIP)

Because some instructions might use operands that are at an offsets relative
to the program counter, copying the instructions to a new location might
change their meaning:

LLDB needs to patch each instruction with the right offset.
This is done using `LLVM::MCInst` tool in order to detect the instructions
that need to be rewritten.

## Risk Mitigation

The optimization relies heavily on code injection, most of which is
architecture specific. Because of this, overwriting the instructions
can fail depending of the breakpoint location, e.g.:

- If the overwritten instructions contains indirection (branch instructions).
- If the overwritten instructions are a branch target.
- If there is not enough instructions to insert the branch instruction (x86_64)

If the setup process fails to insert the Fast Conditional Breakpoint, it will
fallback to the legacy behavior, and warn the user about what went wrong.

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One way to mitigate those limitations would be to use code instrumentation
to detect if it's safe to set a Fast Condition Breakpoint at a certain
location, and hint the user to move the FCB before or after the location where
it was set originally.

## Prototype Code

I submitted my patches ([1](reviews.llvm.org/D66248), [2](reviews.llvm.org/D66249),
[3](reviews.llvm.org/D66250)) on Phabricator with the prototype.

## Feedback

Before moving forward I'd like to get the community's input. What do you
think about this approach? Any feedback would be greatly appreciated!

Thanks,

As my last suggestion, I would like to ask you to consider testing as you're writing this code. This is a pretty complex machinery you're building, and it would be nice if it was possible to test pieces of it in isolation instead of just the large end-to-end kinds of tests. For example, in the "instruction shifter" machinery, it would be nice to be able to take a single instruction, execute both in place, and in a "shifted" location, and assert that the resulting register contents are identical.

regards,
pavel

jingham · August 15, 2019, 6:15pm

Thanks for your great comments. A few replies...

Hello Ismail, and wellcome to LLDB. You have a very interesting (and not entirely trivial) project, and I wish you the best of luck in your work. I think this will be a very useful addition to lldb.

It sounds like you have researched the problem very well, and the overall direction looks good to me. However, I do have some ideas suggestions about possible tweaks/improvements that I would like to hear your thoughts on. Please find my comments inline.
Hi everyone,
I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.
The goal of my internship project is to make conditional breakpoints faster by:
1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.
This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.
This feature is described on the [LLDB Project page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.
Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.
## High Level Design
Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.
First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.
To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.
## Implementation Details
To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.
The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.
      Inferior Binary                                     Trampoline
>            .            |                      +-------------------------+
>            .            |                      |                         |
>            .            |           +--------->+   Save RegisterContext  |
>            .            |           |          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          |  Build Arguments Struct |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>                         +-----------+          |                         |
>   Branch to Trampoline  |                      |  Call Condition Checker |
>                         +<----------+          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          | Restore RegisterContext |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>            .            |           |          |                         |
>            .            |           +----------+ Run Copied Instructions |
>            .            |                      |                         |
>            .            |                      +-------------------------+
Once the execution reaches the Trampoline, several steps need to be taken.
LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression's argument structure.
Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
__BreakpointInjectedSite__ class, which builds the conditional expression for
all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
variables in the `$__lldb_arg` structure.
BreakpointInjectedSites are created in the __Process__ if the user enables
the `-I | --inject-condition` flag when setting or modifying a breakpoint.
Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.
Several parts of lldb have to be modified to implement this feature:
- **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
                  related class (Breakpoint, BreakpointLocation,
                  BreakpointSite, BreakpointOptions)
- **Plugins**: Added ObjectFileTrampoline for the unwinding
                  Added x86_64 ABI support (FCBT setup & safety checks)
- **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
- **Target**: Added BreakpointInjectedSite creation to `Process` to insert
                  the jump to the FCBT
                  Added the Trampoline module creation to `ABI` for the
                  unwinding
### Breakpoint Option
Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
"breakpoint modify" to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.
They can be enabled when using `-I | --inject-condition` option. These options
can also be enabled using the Python Scripting Bridge public API, using the
`InjectCondition(bool enable)` method on an __SBBreakpoint__ or
__SBBreakpointLocation__ object.
This feature is intended to be used with condition expression
(`-c <expr> | --condition <expr>`), but also other conditions types such as:
- Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
- Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
- Thread Queue Name
### Trampoline
To be able to inject the condition, we need to re-route the debugged program's
execution flow. This parts is handled in the __Trampoline__, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program's original behavior.
The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:
1. Save all the registers by pushing them to the stack
2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
3. Check the condition by calling the injected UserExpression and execute a
   trap if the condition is true.
4. Restore register context
5. Rewrite and run original copied instructions operands
All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation ...).
Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).
### BreakpointInjectedSite
To handle the Fast Conditional Breakpoint setup, LLDB uses
__BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different `UserExpression` to resolve variables
and inject the condition checker.
#### Condition Checker
Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn't have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.
Once all the conditions are fetched, LLDB will create a __UserExpression__
with the injected trap instruction.
When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB's
__BreakpointSiteList__.
When generated, this is what the condition checker looks like:
void $__lldb_expr(void *$__lldb_arg)
{
    /*lldb_BODY_START*/
    if (condition) {
        __builtin_debugtrap();
    };
    /*lldb_BODY_END*/
}
#### Argument Builder
The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.
Usually the expression evaluator invokes the __Materializer__ which fetches
the variables values and fills the `$__lldb_arg` structure. But since we don't
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the `$__lldb_arg` pointer, before running the condition
checker.
That's where the __Argument Builder__ comes in.
The argument builder uses an `UtilityFunction` to generate the
`$__lldb_create_args_struct` function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.
`$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:
1. It takes advantage of the fact that LLDB saved all the registers to the
   stack and map them in an `register_context` structure.
    ```cpp
    typedef struct {
    // General Purpose Registers
    } register_context;
    ```
    2. Using information from the variable resolver, it allocates a memory stub
   that will contain the used variable addresses.
3. Then, it will use the register values and the collected metadata to
   compute the used variable address and write that into the
   newly allocated structure.
4. Finally the allocated structure is returned to the trampoline, which will
   pass it as an argument to the injected condition checker.
I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.

You have no guarantee that only one thread is running this code at any given time. So you would have to put a mutex in the condition to guard the use of this stack allocation. That's not impossible but it means you're changing threading behavior. Calling the system allocator might take a lock but a lot of allocation systems can hand out small allocations without locking, so it might be simpler to just take advantage of that.

Since `$__lldb_create_args_struct` uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.
#### Variable Resolver
When creating a Fast Conditional Breakpoint, the __debug info__ tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (__Step 3 of the Argument Builder__).
LLDB will first get the `DeclMap` from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable's __DWARF Expression__.
DWARF expressions explain how to reconstruct a variable's values using DWARF
operations.
The reason why LLDB needs the register context is because local variable are
often at an offset of the __Stack Base Pointer register__ or written across
one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
to create the `$__lldb_arg` structure.
Since all the registers are already mapped to a structure, I should
be able to support more __DWARF Operations__ in the future.
After collecting some metrics on the __Clang__ binary, built at __-O0__,
the debug info shows that __99%__ of the most used DWARF Operations are :
>DWARF Operation| Occurrences |
>---------------|---------------------------|
>DW\_OP_fbreg | 2 114 612 |
>DW\_OP_reg | 820 548 |
>DW\_OP_constu | 267 450 |
>DW\_OP_addr | 17 370 |
> __Top 4__ | __3 219 980 Occurrences__ |
>---------------|---------------------------|
> __Total__ | __3 236 859 Occurrences__ |
Those 4 operations are the one that I'll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.
### Unwinders
When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
    frame #1: 0x0000000100105028
This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user's source code frame.
The injected UserExpression already has a valid stack frame, but it doesn't
have any information about its caller, the Trampoline. In order to unwind to
the user's code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the __ObjectFileTrampoline__ in our case.
It will contain several pieces of information such as, the module's name and
description, but most importantly the module __Symbol Table__ that will have
the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
__Text Section__ that will tell the unwinder the trampoline bounds.
Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.
This is what the backtrace looks like after hitting the injected trap:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
    frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
  * frame #2: 0x0000000100000f5b main`main at main.c:7:23
For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.
A `debug-injected-condition` setting will allow to stop at the FCBT and show
all the elided frames.
Regarding unwinding, I am wondering whether we really need to do anything really special. It sounds to me that if we try a little bit harder then we could make the trampoline code look very much like a signal handler, and have it be treated as such. Then the only special thing we would need to do is to hide the topmost trampoline code somewhere higher up in the presentation layer.

I am imagining the trampoline code could look something like this (excuse my bad assembly, I haven't written that in a while):
pushq %rax
pushq %rbx
...
leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
movq %rsp, %r11 # void *args
subq $SIZE_OF_ARGS, %rsp
movq %r10, %rdi
movq %r11, %rsi
callq __build_args # __build_args(const void *registers, void *args)
movq %r11, %rdi
callq __lldb_expr # __lldb_expr(void *args)
test %al, %al
jz .Ldone
trap_opcode:
int3
.Ldone:
addq $SIZE_OF_ARGS, %rsp
pop everything, execute displaced instructions and jump back

I think this trampoline is pretty similar to what you're proposing, but there are a couple of subtle differences:
- the args structure is allocated on the stack - I already spoke about that
- the testing of the condition happens inside the trampoline
I think this second item has several advantages. Firstly, this means that we hit the breakpoint, we only have one extra frame on the stack. So even if we don't do any extra work in the debugger to hide this stuff, we don't clutter the stack too much.

Secondly, this means we can avoid the "dissasemble and scan for trap opcode" step, which is kind of a hack -- after all, we generated these instructions, so we should _know_ where the trap opcode is. This way, you can emit a special symbol (trap_opcode label in the example above), that lldb can then search for, and know it's location exactly.

And lastly, and this is the most important advantage IMO, is that we are in full control of the kind of unwind info we generate for the trampoline. We can generate the proper eh_frame info for this trampoline which would correctly describe the locations of the registers of the previous frame, so that lldb would automatically be able to find them and display them properly when you do for instance "register read" with the parent frame selected. Hopefully, all this would take is a couple of well-placed .cfi assembler instructions.

Here, I'm imagining we could use the MC layer in llvm do do this thing, either by feeding it a raw assembler string, or by using it's c++ api, whichever is easier. Then we could feed this to the normal jit together with the compiled c++ expression and it would link it all together and load it into memory.

### Instruction Shifter (WIP)
Because some instructions might use operands that are at an offsets relative
to the program counter, copying the instructions to a new location might
change their meaning:
LLDB needs to patch each instruction with the right offset.
This is done using `LLVM::MCInst` tool in order to detect the instructions
that need to be rewritten.
## Risk Mitigation
The optimization relies heavily on code injection, most of which is
architecture specific. Because of this, overwriting the instructions
can fail depending of the breakpoint location, e.g.:
- If the overwritten instructions contains indirection (branch instructions).
- If the overwritten instructions are a branch target.
- If there is not enough instructions to insert the branch instruction (x86_64)
If the setup process fails to insert the Fast Conditional Breakpoint, it will
fallback to the legacy behavior, and warn the user about what went wrong.

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This is a clever idea. It would also mean that you wouldn't have to figure out how to do register saves and restores in code, since debugserver already knows how to do that, and once you are stopped it is probably not much slower to have debugserver do that job than have the trampoline do it. It also has the advantage that you don't need to deal with the problem where the space that you are able to allocate for the trampoline code is too far away from the code you are patching for a simple jump. It would certainly be worth seeing how much faster this makes conditions.

Unless I'm missing something you would still need two traps. One in the main instruction stream and one to stop when the condition is true. But maybe you meant "a single kind of insertion - a trap" not "a single trap instruction" ...

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One side-benefit we are trying to get out of the instruction shifting approach is not having to stop all threads when inserting breakpoints as often as possible. Since we can inject thread ID tests into the condition as well, doing the instruction shifting would mean you could specify thread-specific breakpoints, and then ONLY the threads that match the thread specification would ever have to be stopped. You could also have negative tests so that you could specify "no stop" threads. So I still think it is worthwhile pursuing the full implementation Ismail outlined in the long run.

Jim

labath · August 15, 2019, 6:55pm

Thanks for your great comments. A few replies...

I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.

You have no guarantee that only one thread is running this code at any given time. So you would have to put a mutex in the condition to guard the use of this stack allocation. That's not impossible but it means you're changing threading behavior. Calling the system allocator might take a lock but a lot of allocation systems can hand out small allocations without locking, so it might be simpler to just take advantage of that.

I am sorry, but I am confused. I am suggesting we take a slice of the stack from the thread that happened to hit that breakpoint, and use that memory for the __lldb_arg structure for the purpose of evaluating the condition on that very thread. If two threads hit the breakpoint simultaneously, then we just allocate two chunks of memory on their respective stacks. Or am I misunderstanding something about how this structure is supposed to be used?

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This is a clever idea. It would also mean that you wouldn't have to figure out how to do register saves and restores in code, since debugserver already knows how to do that, and once you are stopped it is probably not much slower to have debugserver do that job than have the trampoline do it. It also has the advantage that you don't need to deal with the problem where the space that you are able to allocate for the trampoline code is too far away from the code you are patching for a simple jump. It would certainly be worth seeing how much faster this makes conditions.

I actually thought we would use the exact same trampoline that would be used for the full solution (so it would do the register saves, restores, etc), and the stub would only help us to avoid trampling over a long sequence of instructions. But other solutions are certainly possible too...

Unless I'm missing something you would still need two traps. One in the main instruction stream and one to stop when the condition is true. But maybe you meant "a single kind of insertion - a trap" not "a single trap instruction"

I meant "a single in the application's instruction stream". The counts of traps in the code that we generate aren't that important, as we can do what we want there. But if we insert just a single trap opcode, then we are guaranteed to overwrite only one instruction, which means the whole "are we overwriting a jump target" discussion becomes moot. OTOH, if we write a full jump code then we can overwrite a *lot* of instructions -- the shortest sequence that can jump anywhere in the address space I can think of is something like pushq %rax; movabsq $WHATEVER, %rax; jmpq *%rax. Something as big as that is fairly likely to overwrite a jump target.

...

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One side-benefit we are trying to get out of the instruction shifting approach is not having to stop all threads when inserting breakpoints as often as possible. Since we can inject thread ID tests into the condition as well, doing the instruction shifting would mean you could specify thread-specific breakpoints, and then ONLY the threads that match the thread specification would ever have to be stopped. You could also have negative tests so that you could specify "no stop" threads. So I still think it is worthwhile pursuing the full implementation Ismail outlined in the long run.

No argument there. I'm am just proposing this as a stepping stone towards the final goal.

Interestingly, this is one of the places where the otherwise annoying linux ptrace behavior may come in really handy. Since a thread hitting a breakpoint does not automatically stop all other threads in the process (we have to manually stop all of them ourselves), the lldb-server could do the trampoline stuff without of the other threads in the process noticing anything.

pl

jingham · August 15, 2019, 7:29pm

Thanks for your great comments. A few replies...

I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.

You have no guarantee that only one thread is running this code at any given time. So you would have to put a mutex in the condition to guard the use of this stack allocation. That's not impossible but it means you're changing threading behavior. Calling the system allocator might take a lock but a lot of allocation systems can hand out small allocations without locking, so it might be simpler to just take advantage of that.

I am sorry, but I am confused. I am suggesting we take a slice of the stack from the thread that happened to hit that breakpoint, and use that memory for the __lldb_arg structure for the purpose of evaluating the condition on that very thread. If two threads hit the breakpoint simultaneously, then we just allocate two chunks of memory on their respective stacks. Or am I misunderstanding something about how this structure is supposed to be used?

Right, I missed that you were suggesting this on the stack - somehow I thought you meant in the allocation we made when setting up the condition.

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This is a clever idea. It would also mean that you wouldn't have to figure out how to do register saves and restores in code, since debugserver already knows how to do that, and once you are stopped it is probably not much slower to have debugserver do that job than have the trampoline do it. It also has the advantage that you don't need to deal with the problem where the space that you are able to allocate for the trampoline code is too far away from the code you are patching for a simple jump. It would certainly be worth seeing how much faster this makes conditions.

I actually thought we would use the exact same trampoline that would be used for the full solution (so it would do the register saves, restores, etc), and the stub would only help us to avoid trampling over a long sequence of instructions. But other solutions are certainly possible too...

Sure, I was thinking that if you were allowing debugserver to intervene, you could just inject something like:

void handler(void) {
   args = arg_builder()
   condition_function(args);
   trap;
}

and let debugserver do:

save registers
set pc to handler
continue
hit a trap
decide whether this was the condition function trap or the handler trap
restore registers
set pc back to trap instruction
return control to lldb or single step over the instruction and continue

That way you wouldn't have to mess with injecting the trampoline, unwinding, etc. This should be pretty easy to hack up, and see how much faster this was.

Unless I'm missing something you would still need two traps. One in the main instruction stream and one to stop when the condition is true. But maybe you meant "a single kind of insertion - a trap" not "a single trap instruction"

I meant "a single in the application's instruction stream". The counts of traps in the code that we generate aren't that important, as we can do what we want there. But if we insert just a single trap opcode, then we are guaranteed to overwrite only one instruction, which means the whole "are we overwriting a jump target" discussion becomes moot. OTOH, if we write a full jump code then we can overwrite a *lot* of instructions -- the shortest sequence that can jump anywhere in the address space I can think of is something like pushq %rax; movabsq $WHATEVER, %rax; jmpq *%rax. Something as big as that is fairly likely to overwrite a jump target.

...

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One side-benefit we are trying to get out of the instruction shifting approach is not having to stop all threads when inserting breakpoints as often as possible. Since we can inject thread ID tests into the condition as well, doing the instruction shifting would mean you could specify thread-specific breakpoints, and then ONLY the threads that match the thread specification would ever have to be stopped. You could also have negative tests so that you could specify "no stop" threads. So I still think it is worthwhile pursuing the full implementation Ismail outlined in the long run.

No argument there. I'm am just proposing this as a stepping stone towards the final goal.

Interestingly, this is one of the places where the otherwise annoying linux ptrace behavior may come in really handy. Since a thread hitting a breakpoint does not automatically stop all other threads in the process (we have to manually stop all of them ourselves), the lldb-server could do the trampoline stuff without of the other threads in the process noticing anything.

You can do the same thing with Mach. When one thread stops the others are still running and we also have to stop them. Mach is just a little more convenient in that there's a single "task_suspend" call that keeps any threads from running, so we don't have to go stop them one by one.

Jim

mib · August 15, 2019, 9:03pm

I built Clang (and LLVM) in Release Mode with Debug Info (-O2),
and got these results:

Dwarf Occurences | Occurences |
----------------------|-----------------|
DW\_OP\_deref | 1,570 |
DW\_OP\_const | 3,791 |
DW\_OP\_addr | 9,528 |
DW\_OP\_lit | 62,826 |
DW\_OP\_fbreg | 205,382 |
DW\_OP\_piece | 242,888 |
DW\_OP\_stack\_value | 992,261 |
DW\_OP\_breg | 1,006,070 |
DW\_OP\_reg | 5,175,831 |
**Total** | **7,700,147** |

I could technically implement the logic to support DW_OP_reg, DW_OP_breg
and DW_OP_stack_value fairly easily (which still represents 90% of all ops).

However, DW_OP_piece is a more complex operation since it combines
several other operations, and would require more work.

This would also imply that there will 2 DWARF Expression Interpreter in
LLDB, hence twice as much code to maintain … I’ll try to see if I can
use the existing interpreter for this feature.

Ismail

adrian.prantl · August 16, 2019, 3:27pm

I built Clang (and LLVM) in Release Mode with Debug Info (-O2),
and got these results:

> Dwarf Occurences | Occurences |
>----------------------|-----------------|
> DW\_OP\_deref | 1,570 |
> DW\_OP\_const | 3,791 |
> DW\_OP\_addr | 9,528 |
> DW\_OP\_lit | 62,826 |
> DW\_OP\_fbreg | 205,382 |
> DW\_OP\_piece | 242,888 |
> DW\_OP\_stack\_value | 992,261 |
> DW\_OP\_breg | 1,006,070 |
> DW\_OP\_reg | 5,175,831 |
> **Total** | **7,700,147** |

I could technically implement the logic to support DW_OP_reg, DW_OP_breg
and DW_OP_stack_value fairly easily (which still represents 90% of all ops).

However, DW_OP_piece is a more complex operation since it combines
several other operations, and would require more work.

This would also imply that there will 2 DWARF Expression Interpreter in
LLDB, hence twice as much code to maintain … I’ll try to see if I can
use the existing interpreter for this feature.

I strongly agree that unless the code can be shared, the JIT-ed DWARF expression interpreter should be kept as simple as possible and aim to support the lion's share of DWARF expressions encountered in a typical program, but making it support 100% is a lot of effort and maintenance burden with very diminishing returns.

-- adrian

Finkel_Hal_J · August 16, 2019, 3:40pm

I built Clang (and LLVM) in Release Mode with Debug Info (-O2),
and got these results:

> Dwarf Occurences | Occurences |
>----------------------|-----------------|
> DW\_OP\_deref | 1,570 |
> DW\_OP\_const | 3,791 |
> DW\_OP\_addr | 9,528 |
> DW\_OP\_lit | 62,826 |
> DW\_OP\_fbreg | 205,382 |
> DW\_OP\_piece | 242,888 |
> DW\_OP\_stack\_value | 992,261 |
> DW\_OP\_breg | 1,006,070 |
> DW\_OP\_reg | 5,175,831 |
> **Total** | **7,700,147** |

I could technically implement the logic to support DW_OP_reg, DW_OP_breg
and DW_OP_stack_value fairly easily (which still represents 90% of all ops).

However, DW_OP_piece is a more complex operation since it combines
several other operations, and would require more work.

This would also imply that there will 2 DWARF Expression Interpreter in
LLDB, hence twice as much code to maintain … I’ll try to see if I can
use the existing interpreter for this feature.

I strongly agree that unless the code can be shared, the JIT-ed DWARF expression interpreter should be kept as simple as possible and aim to support the lion's share of DWARF expressions encountered in a typical program, but making it support 100% is a lot of effort and maintenance burden with very diminishing returns.

+1

(and, thanks for the data! I think it would be useful to support the
things that we can easily support, but the more complicated things
should be weighed carefully against the maintenance costs)

-Hal

mib · August 16, 2019, 6:13pm

Hi Pavel,

Thanks for all your feedbacks.

I’ve been following the discussion closely and find your approach quite interesting.

As Jim explained, I’m also trying to have a conditional breakpoint, that is able to stop a specific thread (name or id) when the condition expression evaluates to true.

I feel like stacking up options with your approach would imply doing more context switches.
But it’s definitely a better fallback mechanism than the current one. I’ll try to make a prototype to see the performance difference for both approaches.

Hello Ismail, and wellcome to LLDB. You have a very interesting (and not entirely trivial) project, and I wish you the best of luck in your work. I think this will be a very useful addition to lldb.

It sounds like you have researched the problem very well, and the overall direction looks good to me. However, I do have some ideas suggestions about possible tweaks/improvements that I would like to hear your thoughts on. Please find my comments inline.
Hi everyone,
I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.
The goal of my internship project is to make conditional breakpoints faster by:
1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.
This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.
This feature is described on the [LLDB Project page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.
Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.
## High Level Design
Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.
First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.
To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.
## Implementation Details
To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.
The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.
      Inferior Binary                                     Trampoline
>            .            |                      +-------------------------+
>            .            |                      |                         |
>            .            |           +--------->+   Save RegisterContext  |
>            .            |           |          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          |  Build Arguments Struct |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>                         +-----------+          |                         |
>   Branch to Trampoline  |                      |  Call Condition Checker |
>                         +<----------+          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          | Restore RegisterContext |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>            .            |           |          |                         |
>            .            |           +----------+ Run Copied Instructions |
>            .            |                      |                         |
>            .            |                      +-------------------------+
Once the execution reaches the Trampoline, several steps need to be taken.
LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression's argument structure.
Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
__BreakpointInjectedSite__ class, which builds the conditional expression for
all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
variables in the `$__lldb_arg` structure.
BreakpointInjectedSites are created in the __Process__ if the user enables
the `-I | --inject-condition` flag when setting or modifying a breakpoint.
Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.
Several parts of lldb have to be modified to implement this feature:
- **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
                  related class (Breakpoint, BreakpointLocation,
                  BreakpointSite, BreakpointOptions)
- **Plugins**: Added ObjectFileTrampoline for the unwinding
                  Added x86_64 ABI support (FCBT setup & safety checks)
- **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
- **Target**: Added BreakpointInjectedSite creation to `Process` to insert
                  the jump to the FCBT
                  Added the Trampoline module creation to `ABI` for the
                  unwinding
### Breakpoint Option
Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
"breakpoint modify" to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.
They can be enabled when using `-I | --inject-condition` option. These options
can also be enabled using the Python Scripting Bridge public API, using the
`InjectCondition(bool enable)` method on an __SBBreakpoint__ or
__SBBreakpointLocation__ object.
This feature is intended to be used with condition expression
(`-c <expr> | --condition <expr>`), but also other conditions types such as:
- Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
- Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
- Thread Queue Name
### Trampoline
To be able to inject the condition, we need to re-route the debugged program's
execution flow. This parts is handled in the __Trampoline__, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program's original behavior.
The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:
1. Save all the registers by pushing them to the stack
2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
3. Check the condition by calling the injected UserExpression and execute a
   trap if the condition is true.
4. Restore register context
5. Rewrite and run original copied instructions operands
All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation ...).
Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).
### BreakpointInjectedSite
To handle the Fast Conditional Breakpoint setup, LLDB uses
__BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different `UserExpression` to resolve variables
and inject the condition checker.
#### Condition Checker
Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn't have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.
Once all the conditions are fetched, LLDB will create a __UserExpression__
with the injected trap instruction.
When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB's
__BreakpointSiteList__.
When generated, this is what the condition checker looks like:
void $__lldb_expr(void *$__lldb_arg)
{
    /*lldb_BODY_START*/
    if (condition) {
        __builtin_debugtrap();
    };
    /*lldb_BODY_END*/
}
#### Argument Builder
The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.
Usually the expression evaluator invokes the __Materializer__ which fetches
the variables values and fills the `$__lldb_arg` structure. But since we don't
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the `$__lldb_arg` pointer, before running the condition
checker.
That's where the __Argument Builder__ comes in.
The argument builder uses an `UtilityFunction` to generate the
`$__lldb_create_args_struct` function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.
`$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:
1. It takes advantage of the fact that LLDB saved all the registers to the
   stack and map them in an `register_context` structure.
    ```cpp
    typedef struct {
    // General Purpose Registers
    } register_context;
    ```
    2. Using information from the variable resolver, it allocates a memory stub
   that will contain the used variable addresses.
3. Then, it will use the register values and the collected metadata to
   compute the used variable address and write that into the
   newly allocated structure.
4. Finally the allocated structure is returned to the trampoline, which will
   pass it as an argument to the injected condition checker.
I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.

Allocating the $__lldb_arg struct in the stack is on my to-do list. This will change in the coming revisions.

Since `$__lldb_create_args_struct` uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.
#### Variable Resolver
When creating a Fast Conditional Breakpoint, the __debug info__ tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (__Step 3 of the Argument Builder__).
LLDB will first get the `DeclMap` from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable's __DWARF Expression__.
DWARF expressions explain how to reconstruct a variable's values using DWARF
operations.
The reason why LLDB needs the register context is because local variable are
often at an offset of the __Stack Base Pointer register__ or written across
one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
to create the `$__lldb_arg` structure.
Since all the registers are already mapped to a structure, I should
be able to support more __DWARF Operations__ in the future.
After collecting some metrics on the __Clang__ binary, built at __-O0__,
the debug info shows that __99%__ of the most used DWARF Operations are :
>DWARF Operation| Occurrences |
>---------------|---------------------------|
>DW\_OP_fbreg | 2 114 612 |
>DW\_OP_reg | 820 548 |
>DW\_OP_constu | 267 450 |
>DW\_OP_addr | 17 370 |
> __Top 4__ | __3 219 980 Occurrences__ |
>---------------|---------------------------|
> __Total__ | __3 236 859 Occurrences__ |
Those 4 operations are the one that I'll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.
### Unwinders
When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
    frame #1: 0x0000000100105028
This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user's source code frame.
The injected UserExpression already has a valid stack frame, but it doesn't
have any information about its caller, the Trampoline. In order to unwind to
the user's code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the __ObjectFileTrampoline__ in our case.
It will contain several pieces of information such as, the module's name and
description, but most importantly the module __Symbol Table__ that will have
the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
__Text Section__ that will tell the unwinder the trampoline bounds.
Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.
This is what the backtrace looks like after hitting the injected trap:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
    frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
  * frame #2: 0x0000000100000f5b main`main at main.c:7:23
For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.
A `debug-injected-condition` setting will allow to stop at the FCBT and show
all the elided frames.
Regarding unwinding, I am wondering whether we really need to do anything really special. It sounds to me that if we try a little bit harder then we could make the trampoline code look very much like a signal handler, and have it be treated as such. Then the only special thing we would need to do is to hide the topmost trampoline code somewhere higher up in the presentation layer.

I am imagining the trampoline code could look something like this (excuse my bad assembly, I haven't written that in a while):
pushq %rax
pushq %rbx
...
leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
movq %rsp, %r11 # void *args
subq $SIZE_OF_ARGS, %rsp
movq %r10, %rdi
movq %r11, %rsi
callq __build_args # __build_args(const void *registers, void *args)
movq %r11, %rdi
callq __lldb_expr # __lldb_expr(void *args)
test %al, %al
jz .Ldone
trap_opcode:
int3
.Ldone:
addq $SIZE_OF_ARGS, %rsp
pop everything, execute displaced instructions and jump back

I think this trampoline is pretty similar to what you're proposing, but there are a couple of subtle differences:
- the args structure is allocated on the stack - I already spoke about that
- the testing of the condition happens inside the trampoline
I think this second item has several advantages. Firstly, this means that we hit the breakpoint, we only have one extra frame on the stack. So even if we don't do any extra work in the debugger to hide this stuff, we don't clutter the stack too much.

Secondly, this means we can avoid the "dissasemble and scan for trap opcode" step, which is kind of a hack -- after all, we generated these instructions, so we should _know_ where the trap opcode is. This way, you can emit a special symbol (trap_opcode label in the example above), that lldb can then search for, and know it's location exactly.

I think testing the condition inside the trampoline might be very limiting:
- The variable resolution would be need to be rethought to allow the condition check to happen in the trampoline.
- To be able to support different condition types (expression / thread name / thread id …), the $__lldb_expr is a better option IMO. In the future, we might also inject logging code that would only be run according to the condition.
- This feature requires at least one more frame (for your approach), that would still need to be hidden to the user. I don’t think hiding 2 frames is more work than hiding 1.

And lastly, and this is the most important advantage IMO, is that we are in full control of the kind of unwind info we generate for the trampoline. We can generate the proper eh_frame info for this trampoline which would correctly describe the locations of the registers of the previous frame, so that lldb would automatically be able to find them and display them properly when you do for instance "register read" with the parent frame selected. Hopefully, all this would take is a couple of well-placed .cfi assembler instructions.

Here, I'm imagining we could use the MC layer in llvm do do this thing, either by feeding it a raw assembler string, or by using it's c++ api, whichever is easier. Then we could feed this to the normal jit together with the compiled c++ expression and it would link it all together and load it into memory.

### Instruction Shifter (WIP)
Because some instructions might use operands that are at an offsets relative
to the program counter, copying the instructions to a new location might
change their meaning:
LLDB needs to patch each instruction with the right offset.
This is done using `LLVM::MCInst` tool in order to detect the instructions
that need to be rewritten.
## Risk Mitigation
The optimization relies heavily on code injection, most of which is
architecture specific. Because of this, overwriting the instructions
can fail depending of the breakpoint location, e.g.:
- If the overwritten instructions contains indirection (branch instructions).
- If the overwritten instructions are a branch target.
- If there is not enough instructions to insert the branch instruction (x86_64)
If the setup process fails to insert the Fast Conditional Breakpoint, it will
fallback to the legacy behavior, and warn the user about what went wrong.

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One way to mitigate those limitations would be to use code instrumentation
to detect if it's safe to set a Fast Condition Breakpoint at a certain
location, and hint the user to move the FCB before or after the location where
it was set originally.
## Prototype Code
I submitted my patches ([1](reviews.llvm.org/D66248), [2](reviews.llvm.org/D66249),
[3](reviews.llvm.org/D66250)) on Phabricator with the prototype.
## Feedback
Before moving forward I'd like to get the community's input. What do you
think about this approach? Any feedback would be greatly appreciated!
Thanks,

As my last suggestion, I would like to ask you to consider testing as you're writing this code. This is a pretty complex machinery you're building, and it would be nice if it was possible to test pieces of it in isolation instead of just the large end-to-end kinds of tests. For example, in the "instruction shifter" machinery, it would be nice to be able to take a single instruction, execute both in place, and in a "shifted" location, and assert that the resulting register contents are identical.

Will do.

regards,
pavel

Thanks,

Ismail.

Frederic_Riss · August 19, 2019, 9:30pm

Hi Pavel,

Thanks for all your feedbacks.

I’ve been following the discussion closely and find your approach quite interesting.

As Jim explained, I’m also trying to have a conditional breakpoint, that is able to stop a specific thread (name or id) when the condition expression evaluates to true.

I feel like stacking up options with your approach would imply doing more context switches.
But it’s definitely a better fallback mechanism than the current one. I’ll try to make a prototype to see the performance difference for both approaches.
Hello Ismail, and wellcome to LLDB. You have a very interesting (and not entirely trivial) project, and I wish you the best of luck in your work. I think this will be a very useful addition to lldb.

It sounds like you have researched the problem very well, and the overall direction looks good to me. However, I do have some ideas suggestions about possible tweaks/improvements that I would like to hear your thoughts on. Please find my comments inline.
Hi everyone,
I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I’m adding Fast Conditional Breakpoints to LLDB, using code patching.
Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.
The goal of my internship project is to make conditional breakpoints faster by:

Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.

Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.
This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.
This feature is described on the LLDB Project page.
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.
Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.

High Level Design

Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.
First, the breakpoint is hit and the control is given to the debugger.
That’s where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.
To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.

Implementation Details

To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.
The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the Fast Conditional Breakpoint Trampoline
or FCBT. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.
Inferior Binary Trampoline
> . | +-------------------------+
> . | | |
> . | +--------->+ Save RegisterContext |
> . | | | |
+-------------------------+ | +-------------------------+
> > > > >
> Instruction | | | Build Arguments Struct |
> > > > >
+-------------------------+ | +-------------------------+
> +-----------+ | |
> Branch to Trampoline | | Call Condition Checker |
> +<----------+ | |
+-------------------------+ | +-------------------------+
> > > > >
> Instruction | | | Restore RegisterContext |
> > > > >
+-------------------------+ | +-------------------------+
> . | | | |
> . | +----------+ Run Copied Instructions |
> . | | |
> . | +-------------------------+
Once the execution reaches the Trampoline, several steps need to be taken.
LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression’s argument structure.
Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
BreakpointInjectedSite class, which builds the conditional expression for
all the BreakpointLocations, emits the $__lldb_expr function, and relocates
variables in the $__lldb_arg structure.
BreakpointInjectedSites are created in the Process if the user enables
the -I | --inject-condition flag when setting or modifying a breakpoint.
Because the FCBT is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.
Several parts of lldb have to be modified to implement this feature:

Breakpoint: Added BreakpointInjectedSite, and helper functions to the
related class (Breakpoint, BreakpointLocation,
BreakpointSite, BreakpointOptions)

Plugins: Added ObjectFileTrampoline for the unwinding
Added x86_64 ABI support (FCBT setup & safety checks)

Symbol: Changed FuncUnwinders and UnwindPlan to support FCBT

Target: Added BreakpointInjectedSite creation to Process to insert
the jump to the FCBT
Added the Trampoline module creation to ABI for the
unwinding

Breakpoint Option

Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
“breakpoint modify” to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.
They can be enabled when using -I | --inject-condition option. These options
can also be enabled using the Python Scripting Bridge public API, using the
InjectCondition(bool enable) method on an SBBreakpoint or
SBBreakpointLocation object.
This feature is intended to be used with condition expression
(-c <expr> | --condition <expr>), but also other conditions types such as:

Thread ID (-t <thread-id> | --thread-id <thread-id>)

Thread Index (-x <thread-index> | --thread-index <thread-index>)

Thread Queue Name

Trampoline

To be able to inject the condition, we need to re-route the debugged program’s
execution flow. This parts is handled in the Trampoline, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program’s original behavior.
The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:

Save all the registers by pushing them to the stack

Build the $__lldb_arg structure by calling a injected UtilityFunction

Check the condition by calling the injected UserExpression and execute a
trap if the condition is true.

Restore register context

Rewrite and run original copied instructions operands
All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation …).
Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).

BreakpointInjectedSite

To handle the Fast Conditional Breakpoint setup, LLDB uses
BreakpointInjectedSites which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different UserExpression to resolve variables
and inject the condition checker.

Condition Checker

Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn’t have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.
Once all the conditions are fetched, LLDB will create a UserExpression
with the injected trap instruction.
When a trap is hit, LLDB uses the BreakpointSiteList, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB’s
BreakpointSiteList.
When generated, this is what the condition checker looks like:
void $__lldb_expr(void *$__lldb_arg)
{
/*lldb_BODY_START*/
if (condition) {
__builtin_debugtrap();
};
/*lldb_BODY_END*/
}
Argument Builder

The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.
Usually the expression evaluator invokes the Materializer which fetches
the variables values and fills the $__lldb_arg structure. But since we don’t
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the $__lldb_arg pointer, before running the condition
checker.
That’s where the Argument Builder comes in.
The argument builder uses an UtilityFunction to generate the
$__lldb_create_args_struct function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.
$__lldb_create_args_struct will fill the $__lldb_arg in several steps:

It takes advantage of the fact that LLDB saved all the registers to the
stack and map them in an register_context structure.
typedef struct {
// General Purpose Registers
} register_context;
Using information from the variable resolver, it allocates a memory stub
that will contain the used variable addresses.

Then, it will use the register values and the collected metadata to
compute the used variable address and write that into the
newly allocated structure.

Finally the allocated structure is returned to the trampoline, which will
pass it as an argument to the injected condition checker.
I am wondering whether we really need to involve the memory allocation functions here. What’s the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that’s the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.
Allocating the $__lldb_arg struct in the stack is on my to-do list. This will change in the coming revisions.
Since $__lldb_create_args_struct uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.

Variable Resolver

When creating a Fast Conditional Breakpoint, the debug info tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (Step 3 of the Argument Builder).
LLDB will first get the DeclMap from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable’s DWARF Expression.
DWARF expressions explain how to reconstruct a variable’s values using DWARF
operations.
The reason why LLDB needs the register context is because local variable are
often at an offset of the Stack Base Pointer register or written across
one or multiple registers. This is why I’ve only focused on DW_OP_fbreg
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an ArgumentMetadata vector that will be used by the ArgumentBuilder
to create the $__lldb_arg structure.
Since all the registers are already mapped to a structure, I should
be able to support more DWARF Operations in the future.
After collecting some metrics on the Clang binary, built at -O0,
the debug info shows that 99% of the most used DWARF Operations are :

DWARF Operation Occurrences

DW_OP_fbreg 2 114 612

DW_OP_reg 820 548

DW_OP_constu 267 450

DW_OP_addr 17 370

Top 4 3 219 980 Occurrences

--------------- ---------------------------

Total 3 236 859 Occurrences

Those 4 operations are the one that I’ll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.

Unwinders

When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
frame #1: 0x0000000100105028
This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user’s source code frame.
The injected UserExpression already has a valid stack frame, but it doesn’t
have any information about its caller, the Trampoline. In order to unwind to
the user’s code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the ObjectFileTrampoline in our case.
It will contain several pieces of information such as, the module’s name and
description, but most importantly the module Symbol Table that will have
the trampoline symbol ($__lldb_injected_conditional_bp_trampoline ) and a
Text Section that will tell the unwinder the trampoline bounds.
Then, LLDB inserts a Function Unwinder in the module UnwindTable and
creates an Unwind Plan pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.
This is what the backtrace looks like after hitting the injected trap:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
* frame #2: 0x0000000100000f5b main`main at main.c:7:23
For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.
A debug-injected-condition setting will allow to stop at the FCBT and show
all the elided frames.
Regarding unwinding, I am wondering whether we really need to do anything really special. It sounds to me that if we try a little bit harder then we could make the trampoline code look very much like a signal handler, and have it be treated as such. Then the only special thing we would need to do is to hide the topmost trampoline code somewhere higher up in the presentation layer.

I am imagining the trampoline code could look something like this (excuse my bad assembly, I haven’t written that in a while):
pushq %rax
pushq %rbx
…
leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
movq %rsp, %r11 # void *args
subq $SIZE_OF_ARGS, %rsp
movq %r10, %rdi
movq %r11, %rsi
callq __build_args # __build_args(const void *registers, void *args)
movq %r11, %rdi
callq __lldb_expr # __lldb_expr(void *args)
test %al, %al
jz .Ldone
trap_opcode:
int3
.Ldone:
addq $SIZE_OF_ARGS, %rsp
pop everything, execute displaced instructions and jump back

I think this trampoline is pretty similar to what you’re proposing, but there are a couple of subtle differences:

the args structure is allocated on the stack - I already spoke about that

the testing of the condition happens inside the trampoline
I think this second item has several advantages. Firstly, this means that we hit the breakpoint, we only have one extra frame on the stack. So even if we don’t do any extra work in the debugger to hide this stuff, we don’t clutter the stack too much.

Secondly, this means we can avoid the “dissasemble and scan for trap opcode” step, which is kind of a hack – after all, we generated these instructions, so we should know where the trap opcode is. This way, you can emit a special symbol (trap_opcode label in the example above), that lldb can then search for, and know it’s location exactly.
I think testing the condition inside the trampoline might be very limiting:

The variable resolution would be need to be rethought to allow the condition check to happen in the trampoline.

To be able to support different condition types (expression / thread name / thread id …), the $__lldb_expr is a better option IMO. In the future, we might also inject logging code that would only be run according to the condition.

This feature requires at least one more frame (for your approach), that would still need to be hidden to the user. I don’t think hiding 2 frames is more work than hiding 1.

I might be the one misunderstanding, but I think you missed Pavel’s point. In Pavel’s model, you still JIT the condition into __llldb_expr and pas it the argument structure. The difference is that you don’t have the trap inside of the JITed code, you have the JITed code return whether to stop or not and have the trampoline hit the trap depending in the return value. I agree this seems cleaner than scanning the output to find the trap.

Fred

mib · August 19, 2019, 10:11pm

Hi Pavel,

Thanks for all your feedbacks.

I’ve been following the discussion closely and find your approach quite interesting.

As Jim explained, I’m also trying to have a conditional breakpoint, that is able to stop a specific thread (name or id) when the condition expression evaluates to true.

I feel like stacking up options with your approach would imply doing more context switches.
But it’s definitely a better fallback mechanism than the current one. I’ll try to make a prototype to see the performance difference for both approaches.
Hello Ismail, and wellcome to LLDB. You have a very interesting (and not entirely trivial) project, and I wish you the best of luck in your work. I think this will be a very useful addition to lldb.

It sounds like you have researched the problem very well, and the overall direction looks good to me. However, I do have some ideas suggestions about possible tweaks/improvements that I would like to hear your thoughts on. Please find my comments inline.
Hi everyone,
I’m Ismail, a compiler engineer intern at Apple. As a part of my internship,
I'm adding Fast Conditional Breakpoints to LLDB, using code patching.
Currently, the expressions that power conditional breakpoints are lowered
to LLVM IR and LLDB knows how to interpret a subset of it. If that fails,
the debugger JIT-compiles the expression (compiled once, and re-run on each
breakpoint hit). In both cases LLDB must collect all program state used in
the condition and pass it to the expression.
The goal of my internship project is to make conditional breakpoints faster by:
1. Compiling the expression ahead-of-time, when setting the breakpoint and
inject into the inferior memory only once.
2. Re-route the inferior execution flow to run the expression and check whether
it needs to stop, in-process.
This saves the cost of having to do the context switch between debugger and
the inferior program (about 10 times) to compile and evaluate the condition.
This feature is described on the [LLDB Project page](https://lldb.llvm.org/status/projects.html#use-the-jit-to-speed-up-conditional-breakpoint-evaluation).
The goal would be to have it working for most languages and architectures
supported by LLDB, however my original implementation will be for C-based
languages targeting x86_64. It will be extended to AArch64 afterwards.
Note the way my prototype is implemented makes it fully extensible for other
languages and architectures.
## High Level Design
Every time a breakpoint that holds a condition is hit, multiple context
switches are needed in order to compile and evaluate the condition.
First, the breakpoint is hit and the control is given to the debugger.
That's where LLDB wraps the condition expression into a UserExpression that
will get compiled and injected into the program memory. Another round-trip
between the inferior and the LLDB is needed to run the compiled expression
and extract the expression results that will tell LLDB to stop or not.
To get rid of those context switches, we will evaluate the condition inside
the program, and only stop when the condition is true. LLDB will achieve this
by inserting a jump from the breakpoint address to a code section that will
be allocated into the program memory. It will save the thread state, run the
condition expression, restore the thread state and then execute the copied
instruction(s) before jumping back to the regular program flow.
Then we only trap and return control to LLDB when the condition is true.
## Implementation Details
To be able to evaluate a breakpoint condition without interacting with the
debugger, LLDB changes the inferior program execution flow by overwriting
the instruction at which the breakpoint was set with a branching instruction.
The original instruction(s) are copied to a memory stub allocated in the
inferior program memory called the __Fast Conditional Breakpoint Trampoline__
or __FCBT__. The FCBT will allow us the re-route the program execution flow to
check the condition in-process while preserving the original program behavior.
This part is critical to setup Fast Conditional Breakpoints.
     Inferior Binary                                     Trampoline
>            .            |                      +-------------------------+
>            .            |                      |                         |
>            .            |           +--------->+   Save RegisterContext  |
>            .            |           |          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          |  Build Arguments Struct |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>                         +-----------+          |                         |
>   Branch to Trampoline  |                      |  Call Condition Checker |
>                         +<----------+          |                         |
+-------------------------+           |          +-------------------------+
>                         >           >          >                         >
>       Instruction       |           |          | Restore RegisterContext |
>                         >           >          >                         >
+-------------------------+           |          +-------------------------+
>            .            |           |          |                         |
>            .            |           +----------+ Run Copied Instructions |
>            .            |                      |                         |
>            .            |                      +-------------------------+
Once the execution reaches the Trampoline, several steps need to be taken.
LLDB relies on its UserExpressions to JIT these more complex conditional
expressions. However, since the execution will be handled by the debugged
program, LLDB will generate some code ahead-of-time in theTrampoline that
will allow the inferior to initialize the expression's argument structure.
Generating the condition checker as well as the code to initialize
the argument structure of each breakpoint hit is handled by
__BreakpointInjectedSite__ class, which builds the conditional expression for
all the BreakpointLocations, emits the `$__lldb_expr` function, and relocates
variables in the `$__lldb_arg` structure.
BreakpointInjectedSites are created in the __Process__ if the user enables
the `-I | --inject-condition` flag when setting or modifying a breakpoint.
Because the __FCBT__ is architecture specific, BreakpointInjectedSites will
only be available when a target has added support to it, in the matching
Architecture Plugin.
Several parts of lldb have to be modified to implement this feature:
- **Breakpoint**: Added BreakpointInjectedSite, and helper functions to the
                 related class (Breakpoint, BreakpointLocation,
                 BreakpointSite, BreakpointOptions)
- **Plugins**: Added ObjectFileTrampoline for the unwinding
                 Added x86_64 ABI support (FCBT setup & safety checks)
- **Symbol**: Changed `FuncUnwinders` and `UnwindPlan` to support FCBT
- **Target**: Added BreakpointInjectedSite creation to `Process` to insert
                 the jump to the FCBT
                 Added the Trampoline module creation to `ABI` for the
                 unwinding
### Breakpoint Option
Since Fast Conditional Breakpoints are still under development, they will not
be on by default, but rather we will provide a flag to 'breakpoint set" and
"breakpoint modify" to enable the feature. Note that the end-goal is to have
them as a default and only fallback to regular conditional breakpoints on
unsupported architectures.
They can be enabled when using `-I | --inject-condition` option. These options
can also be enabled using the Python Scripting Bridge public API, using the
`InjectCondition(bool enable)` method on an __SBBreakpoint__ or
__SBBreakpointLocation__ object.
This feature is intended to be used with condition expression
(`-c <expr> | --condition <expr>`), but also other conditions types such as:
- Thread ID (`-t <thread-id> | --thread-id <thread-id>`)
- Thread Index (`-x <thread-index> | --thread-index <thread-index>`)
- Thread Queue Name
### Trampoline
To be able to inject the condition, we need to re-route the debugged program's
execution flow. This parts is handled in the __Trampoline__, a memory stub
allocated in the inferior that will contain the condition check while
preserving the program's original behavior.
The trampoline is architecture specific and built by lldb. To have the
condition evaluation work out-of-place, several steps need to be completed:
1. Save all the registers by pushing them to the stack
2. Build the `$__lldb_arg` structure by calling a injected UtilityFunction
3. Check the condition by calling the injected UserExpression and execute a
  trap if the condition is true.
4. Restore register context
5. Rewrite and run original copied instructions operands
All the values needed for the steps can be computed ahead of time, when the
breakpoint is set (i.e: size of the allocation, jump address, relocation ...).
Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline. Also, the allocation region for the trampoline might be too far
away for a single jump, so we might need to have several branch island before
reaching the trampoline (WIP).
### BreakpointInjectedSite
To handle the Fast Conditional Breakpoint setup, LLDB uses
__BreakpointInjectedSites__ which is a sub-class of the BreakpointSite class.
BreakpointInjectedSites uses different `UserExpression` to resolve variables
and inject the condition checker.
#### Condition Checker
Because a BreakpointSite can have multiple BreakpointLocations with different
conditions, LLDB need first iterate over each owner of the BreakpointSite and
gather all the conditions. If one of the BreakpointLocations doesn't have a
condition or the condition is not set to be injected, the
BreakpointInjectedSite will behave as a regular BreakpointSite.
Once all the conditions are fetched, LLDB will create a __UserExpression__
with the injected trap instruction.
When a trap is hit, LLDB uses the __BreakpointSiteList__, a map from a trap
address to a BreakpointSite to identify where to stop. To allow LLDB to catch
the injected trap at runtime, it will disassemble the compiled expression and
scan for the trap address. The injected trap address is then added to LLDB's
__BreakpointSiteList__.
When generated, this is what the condition checker looks like:
void $__lldb_expr(void *$__lldb_arg)
{
   /*lldb_BODY_START*/
   if (condition) {
       __builtin_debugtrap();
   };
   /*lldb_BODY_END*/
}
#### Argument Builder
The conditional expression will often refer to local variables, and the
references to these variables need to be tied to the instances of them in the
current frame.
Usually the expression evaluator invokes the __Materializer__ which fetches
the variables values and fills the `$__lldb_arg` structure. But since we don't
want to switch contexts, LLDB has to resolve used variables by generating code
that will initialize the `$__lldb_arg` pointer, before running the condition
checker.
That's where the __Argument Builder__ comes in.
The argument builder uses an `UtilityFunction` to generate the
`$__lldb_create_args_struct` function. It is called by the Trampoline
before the condition checker, in order to resolve variables used in the
condition expression.
`$__lldb_create_args_struct` will fill the `$__lldb_arg` in several steps:
1. It takes advantage of the fact that LLDB saved all the registers to the
stack and map them in an `register_context` structure.
   typedef struct {
   // General Purpose Registers
   } register_context;
   2. Using information from the variable resolver, it allocates a memory stub
  that will contain the used variable addresses.
3. Then, it will use the register values and the collected metadata to
  compute the used variable address and write that into the
  newly allocated structure.
4. Finally the allocated structure is returned to the trampoline, which will
  pass it as an argument to the injected condition checker.
I am wondering whether we really need to involve the memory allocation functions here. What's the size of this address structure? I would expect it to be relatively small compared to the size of the entire register context that we have just saved to the stack. If that's the case, the case then maybe we could have the trampoline allocate some space on the stack and pass that as an argument to the $__lldb_arg building code.
Allocating the $__lldb_arg struct in the stack is on my to-do list. This will change in the coming revisions.
Since `$__lldb_create_args_struct` uses the same JIT Engine as the
UserExpression, LLDB will parse, build and insert it in the program memory.
#### Variable Resolver
When creating a Fast Conditional Breakpoint, the __debug info__ tells us
where the used variables are located. Using this information and the saved
register context, we can generate code that will resolve the variables at
runtime (__Step 3 of the Argument Builder__).
LLDB will first get the `DeclMap` from the condition UserExpression and pull a
list of the used variables. While iterating on that list, LLDB extracts each
variable's __DWARF Expression__.
DWARF expressions explain how to reconstruct a variable's values using DWARF
operations.
The reason why LLDB needs the register context is because local variable are
often at an offset of the __Stack Base Pointer register__ or written across
one or multiple registers. This is why I've only focused on `DW_OP_fbreg`
expressions since I could get the offset of the variable and add it to the
base pointer register to get its address. The variable address, and other
metadata such as its size, its identifier and the DWARF Expression are saved
to an `ArgumentMetadata` vector that will be used by the `ArgumentBuilder`
to create the `$__lldb_arg` structure.
Since all the registers are already mapped to a structure, I should
be able to support more __DWARF Operations__ in the future.
After collecting some metrics on the __Clang__ binary, built at __-O0__,
the debug info shows that __99%__ of the most used DWARF Operations are :
>DWARF Operation| Occurrences |
>---------------|---------------------------|
>DW\_OP_fbreg | 2 114 612 |
>DW\_OP_reg | 820 548 |
>DW\_OP_constu | 267 450 |
>DW\_OP_addr | 17 370 |
> __Top 4__ | __3 219 980 Occurrences__ |
>---------------|---------------------------|
> __Total__ | __3 236 859 Occurrences__ |
Those 4 operations are the one that I'll support for now.
To support more complex expressions, we would need to JIT-compile
a DWARF expression interpreter.
### Unwinders
When the program hits the injected trap instruction, the execution stops
inside the injected UserExpression.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
 * frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f5671000) at lldb-33192c.expr:49
   frame #1: 0x0000000100105028
This part of the program should be transparent to user. To allow LLDB to
elide the condition checker and the FCBT frame, the Unwinder needs to be
able to identify all of the frames, up to the user's source code frame.
The injected UserExpression already has a valid stack frame, but it doesn't
have any information about its caller, the Trampoline. In order to unwind to
the user's code, LLDB needs symbolic information for the trampoline.
This information is tied to LLDB modules, created using an ObjectFile
representation, the __ObjectFileTrampoline__ in our case.
It will contain several pieces of information such as, the module's name and
description, but most importantly the module __Symbol Table__ that will have
the trampoline symbol (`$__lldb_injected_conditional_bp_trampoline `) and a
__Text Section__ that will tell the unwinder the trampoline bounds.
Then, LLDB inserts a __Function Unwinder__ in the module UnwindTable and
creates an __Unwind Plan__ pointing to the BreakpointLocation return address.
This is done taking into consideration that the trampoline will alter the
memory layout by spilling registers to the stack.
Finally, the newly created module is appended to the target image list, which
allows LLDB to move between the injected code and the user code seamlessly.
This is what the backtrace looks like after hitting the injected trap:
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
   frame #0: 0x00000001001070b9 $__lldb_expr`$__lldb_expr($__lldb_arg=0x00000001f4c71000) at lldb-ca98b7.expr:49
   frame #1: 0x0000000100105028 $__lldb_injected_conditional_bp_trampoline`$__lldb_injected_conditional_bp_trampoline + 40
 * frame #2: 0x0000000100000f5b main`main at main.c:7:23
For now, LLDB selects the user frame but the goal would be to mask all the
frames introduced by the Fast Conditional Breakpoint.
A `debug-injected-condition` setting will allow to stop at the FCBT and show
all the elided frames.
Regarding unwinding, I am wondering whether we really need to do anything really special. It sounds to me that if we try a little bit harder then we could make the trampoline code look very much like a signal handler, and have it be treated as such. Then the only special thing we would need to do is to hide the topmost trampoline code somewhere higher up in the presentation layer.

I am imagining the trampoline code could look something like this (excuse my bad assembly, I haven't written that in a while):
pushq %rax
pushq %rbx
...
leaq $SIZE_OF_REGISTER_CONTEXT(%rsp), %r10 # void *registers
movq %rsp, %r11 # void *args
subq $SIZE_OF_ARGS, %rsp
movq %r10, %rdi
movq %r11, %rsi
callq __build_args # __build_args(const void *registers, void *args)
movq %r11, %rdi
callq __lldb_expr # __lldb_expr(void *args)
test %al, %al
jz .Ldone
trap_opcode:
int3
.Ldone:
addq $SIZE_OF_ARGS, %rsp
pop everything, execute displaced instructions and jump back

I think this trampoline is pretty similar to what you're proposing, but there are a couple of subtle differences:
- the args structure is allocated on the stack - I already spoke about that
- the testing of the condition happens inside the trampoline
I think this second item has several advantages. Firstly, this means that we hit the breakpoint, we only have one extra frame on the stack. So even if we don't do any extra work in the debugger to hide this stuff, we don't clutter the stack too much.

Secondly, this means we can avoid the "dissasemble and scan for trap opcode" step, which is kind of a hack -- after all, we generated these instructions, so we should _know_ where the trap opcode is. This way, you can emit a special symbol (trap_opcode label in the example above), that lldb can then search for, and know it's location exactly.
I think testing the condition inside the trampoline might be very limiting:
- The variable resolution would be need to be rethought to allow the condition check to happen in the trampoline.
- To be able to support different condition types (expression / thread name / thread id …), the $__lldb_expr is a better option IMO. In the future, we might also inject logging code that would only be run according to the condition.
- This feature requires at least one more frame (for your approach), that would still need to be hidden to the user. I don’t think hiding 2 frames is more work than hiding 1.
I might be the one misunderstanding, but I think you missed Pavel’s point. In Pavel’s model, you still JIT the condition into __llldb_expr and pas it the argument structure. The difference is that you don’t have the trap inside of the JITed code, you have the JITed code return whether to stop or not and have the trampoline hit the trap depending in the return value. I agree this seems cleaner than scanning the output to find the trap.

Inserting the trap in the trampoline would still require to fetch the $__lldb_expr's return value (architecture-specific) and write an assembly check statement (compare and jump).
Right now, all of this is abstracted by the UserExpression.

I do agree that it’s cleaner, and will take it into consideration for my next patches.

Fred

And lastly, and this is the most important advantage IMO, is that we are in full control of the kind of unwind info we generate for the trampoline. We can generate the proper eh_frame info for this trampoline which would correctly describe the locations of the registers of the previous frame, so that lldb would automatically be able to find them and display them properly when you do for instance "register read" with the parent frame selected. Hopefully, all this would take is a couple of well-placed .cfi assembler instructions.

Here, I'm imagining we could use the MC layer in llvm do do this thing, either by feeding it a raw assembler string, or by using it's c++ api, whichever is easier. Then we could feed this to the normal jit together with the compiled c++ expression and it would link it all together and load it into memory.

### Instruction Shifter (WIP)
Because some instructions might use operands that are at an offsets relative
to the program counter, copying the instructions to a new location might
change their meaning:
LLDB needs to patch each instruction with the right offset.
This is done using `LLVM::MCInst` tool in order to detect the instructions
that need to be rewritten.
## Risk Mitigation
The optimization relies heavily on code injection, most of which is
architecture specific. Because of this, overwriting the instructions
can fail depending of the breakpoint location, e.g.:
- If the overwritten instructions contains indirection (branch instructions).
- If the overwritten instructions are a branch target.
- If there is not enough instructions to insert the branch instruction (x86_64)
If the setup process fails to insert the Fast Conditional Breakpoint, it will
fallback to the legacy behavior, and warn the user about what went wrong.

Another possible fallback behavior would be to still do the whole trampoline stuff and everything, but avoid needing to overwrite opcodes in the target by having the gdb stub do this work for us. So, we could teach the stub that some addresses are special and when a breakpoint at this location gets hit, it should automatically change the program counter to some other location (the address of our trampoline) and let the program continue. This way, you would only need to insert a single trap instruction, which is what we know how to do already. And I believe this would still bring a major speedup compared to the current implementation (particularly if the target is remote on a high-latency link, but even in the case of local debugging, I would expect maybe an order of magnitude faster processing of conditional breakpoints).

This would be kind of similar to the "cond_list" in the gdb-remote "Z0;addr,kind;cond_list" packet <https://sourceware.org/gdb/onlinedocs/gdb/Packets.html>\.

In fact, given that this "instruction shifting" is the most unpredictable part of this whole architecture (because we don't control the contents of the inferior instructions), it might make sense to do this approach first, and then do the instruction shifting as a follow-up.

One way to mitigate those limitations would be to use code instrumentation
to detect if it's safe to set a Fast Condition Breakpoint at a certain
location, and hint the user to move the FCB before or after the location where
it was set originally.
## Prototype Code
I submitted my patches ([1](reviews.llvm.org/D66248), [2](reviews.llvm.org/D66249),
[3](reviews.llvm.org/D66250)) on Phabricator with the prototype.
## Feedback
Before moving forward I'd like to get the community's input. What do you
think about this approach? Any feedback would be greatly appreciated!
Thanks,

As my last suggestion, I would like to ask you to consider testing as you're writing this code. This is a pretty complex machinery you're building, and it would be nice if it was possible to test pieces of it in isolation instead of just the large end-to-end kinds of tests. For example, in the "instruction shifter" machinery, it would be nice to be able to take a single instruction, execute both in place, and in a "shifted" location, and assert that the resulting register contents are identical.

Will do.

regards,
pavel

Thanks,

Ismail.

_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
lldb-dev Info Page

Ismail

labath · August 20, 2019, 7:46am

Note that this does not really have to be a proper "return value". If it makes anything easier, the result can also be returned through the __lldb_expr struct similar to how we "return" values from normal expressions.

That said, I think the reason why fetching the return value by hand sounds scary is because the current process of constructing the trampoline is pretty clunky -- you need to memcpy opcodes and their arguments around by hand. If we switched to creating the trampoline by injecting file-level asm into the compiled expression, then the trampoline would become pretty much a static string embedded into the ABI plugin, and "fetching the return value" would mean writing something like "testb %al, %al" into that string.

cheers,
pl

Tamas_Berghammer · August 20, 2019, 10:30am

It is great that you are looking at supporting these fast breakpoints
but I am concerned about the instruction moving code along the same
lines Pavel mentioned. Copying instructions from 1 location to another
is fairly complicated even without considering the issue of jump
targets and jump target detection makes it even harder.

For reference, I implemented a similar system to do code shifting only
on prologue instructions using LLVM what you might find useful for
reference at https://github.com/google/gapid/tree/master/gapii/interceptor-lib/cc
(Apache v2 license) in case you decide to go down this path.

That system doesn't try to detect jump targets and only handles a
small subset of the instructions but I think shows the general
complexity. On X86_64 I think the number of instructions needs
rewriting are relatively small as most of them aren't PC relative but
for example on ARM where (almost) any instruction can take PC as a
register it will be a monumental task that is very hard to test (I
would expect AArch64 to be somewhere between X86_64 and ARM in terms
of complexity due to PC relative instructions but no general purpose
PC register).

In my view this discussion leads to the question of how we trade
performance for accuracy/reliability. We can easily gain a lot of
performance by being a bit sloppy and assume that we can safely insert
trampolines into the middle of the function but I would want my
debugger to "never lie" or crash my program.

Tamas

jingham · August 20, 2019, 4:58pm

It is great that you are looking at supporting these fast breakpoints
but I am concerned about the instruction moving code along the same
lines Pavel mentioned. Copying instructions from 1 location to another
is fairly complicated even without considering the issue of jump
targets and jump target detection makes it even harder.

For reference, I implemented a similar system to do code shifting only
on prologue instructions using LLVM what you might find useful for
reference at https://github.com/google/gapid/tree/master/gapii/interceptor-lib/cc
(Apache v2 license) in case you decide to go down this path.

That system doesn't try to detect jump targets and only handles a
small subset of the instructions but I think shows the general
complexity. On X86_64 I think the number of instructions needs
rewriting are relatively small as most of them aren't PC relative but
for example on ARM where (almost) any instruction can take PC as a
register it will be a monumental task that is very hard to test (I
would expect AArch64 to be somewhere between X86_64 and ARM in terms
of complexity due to PC relative instructions but no general purpose
PC register).

In my view this discussion leads to the question of how we trade
performance for accuracy/reliability. We can easily gain a lot of
performance by being a bit sloppy and assume that we can safely insert
trampolines into the middle of the function but I would want my
debugger to "never lie" or crash my program.

While this can indeed be complicated, provided we are humble about our abilities, and back out on instructions we can't handle, we won't ever break your program. We'll just sometimes fail to set fast conditional breakpoints, and fall back to slow ones. We can announce this fact, and maybe even say "would it be okay if I moved this breakpoint two instructions north, then I could support fast conditions". So either the user can allow us to auto-adjust the location or pick another one by hand.

We could even implement Pavel's suggestion as the fallback so we would still get some speedup.

But as I said earlier in the thread, we're going to have to figure out how to do something like this if we really want to have "keep alive" threads in the presence of breakpoints. Since that seems a generally desirable feature, I think it's worth the effort to try to make this work.

Jim

Pedro_Alves · August 21, 2019, 10:48pm

Hi,

Very interesting.

One comment below, about something that jumped at me when
I skimmed the proposal.

Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline.

If I understood you correctly, you meant to say that LLDB moves
enough instructions _at the breakpoint address_ to be able to
overwrite them with a jump to the trampoline?

It's the plural (instructionS) that jumped at me.
If so, how do you plan to handle the case of some thread currently
executing one of the instructions that you're overwriting?

Say, you're using a 5 bytes jmp instruction to jump to the
trampoline, so you need to replace 5 bytes at the breakpoint address.
But the instruction at the breakpoint address is shorter than
5 bytes. Like:

ADDR | BEFORE | AFTER

mib · August 21, 2019, 11:36pm

Hi Pedro,

Hi,

Very interesting.

One comment below, about something that jumped at me when
I skimmed the proposal.

Since the x86_64 ISA has variable instruction size, LLDB moves enough
instructions in the trampoline to be able to overwrite them with a jump to the
trampoline.

If I understood you correctly, you meant to say that LLDB moves
enough instructions _at the breakpoint address_ to be able to
overwrite them with a jump to the trampoline?

It's the plural (instructionS) that jumped at me.
If so, how do you plan to handle the case of some thread currently
executing one of the instructions that you're overwriting?

Say, you're using a 5 bytes jmp instruction to jump to the
trampoline, so you need to replace 5 bytes at the breakpoint address.
But the instruction at the breakpoint address is shorter than
5 bytes. Like:

ADDR | BEFORE | AFTER
---------------------------------------
0000 | INSN1 (1 byte) | JMP (5 bytes)
0001 | INSN2 (2 bytes) | <<< thread T's PC points here
0002 | |
0003 | INSN3 (2 bytes) |

Now once you resume execution, thread T is going to execute a bogus
instruction at ADDR 0001.

That’s a relevant point.

I haven’t thought of it, but I think this can be mitigated by checking at
the time of replacing the instructions if any thread is within the copied
instructions bounds.

If so, I’ll change all the threads' pcs that are in the critical region to
point to new copied instruction location (inside the trampoline).

This way, it won’t change the execution flow of the program.

Thanks for pointing out this issue, I’ll make sure to add a fix to my
implementation.

If you have any other suggestion on how to tackle this problem, I’d like
really to know about it :).

GDB does something similar to this for fast tracepoints (replaces
the tracepointed instruction with a jump to a trampoline area
that does the tracepoint collection, all without traps), and because
of the above, GDB currently keeps it simple and only allows setting
fast tracepoints at addresses with instructions longer than
the jump-to-trampoline jump instruction used.

Thanks,
Pedro Alves

Sincerely,

Ismail

Pedro_Alves · August 22, 2019, 12:29pm

Say, you're using a 5 bytes jmp instruction to jump to the
trampoline, so you need to replace 5 bytes at the breakpoint address.
But the instruction at the breakpoint address is shorter than
5 bytes. Like:

ADDR | BEFORE | AFTER
---------------------------------------
0000 | INSN1 (1 byte) | JMP (5 bytes)
0001 | INSN2 (2 bytes) | <<< thread T's PC points here
0002 | |
0003 | INSN3 (2 bytes) |

Now once you resume execution, thread T is going to execute a bogus
instruction at ADDR 0001.

That’s a relevant point.

I haven’t thought of it, but I think this can be mitigated by checking at
the time of replacing the instructions if any thread is within the copied
instructions bounds.

If so, I’ll change all the threads' pcs that are in the critical region to
point to new copied instruction location (inside the trampoline).

This way, it won’t change the execution flow of the program.

Yes, I think that would work, assuming that you can stop all threads,
or all threads are already stopped, which I believe is true with
LLDB currently. If any thread is running (like in gdb's non-stop mode)
then you can't do that, of course.

Thanks for pointing out this issue, I’ll make sure to add a fix to my
implementation.

If you have any other suggestion on how to tackle this problem, I’d like
really to know about it :).

Not off hand. I think I'd take a look at Dyninst, see if they have
some sophisticated way to handle this scenario.

Thanks,
Pedro Alves

clayborg · August 22, 2019, 10:35pm

Another possibility is to have the IDE insert NOP opcodes for you when you write a breakpoint with a condition and compile NOPs into your program.

So the flow is:
- set a breakpoint in IDE
- modify breakpoint to add a condition
- compile and debug, the IDE inserts NOP instructions at the right places
- now when you debug you have a NOP you can use and not have to worry about moving instructions

mib · August 22, 2019, 10:58pm

Hi Greg,

Thanks for your suggestion!

Another possibility is to have the IDE insert NOP opcodes for you when you write a breakpoint with a condition and compile NOPs into your program.

So the flow is:
- set a breakpoint in IDE
- modify breakpoint to add a condition
- compile and debug, the IDE inserts NOP instructions at the right places

We’re trying to avoid rebuilding every time we want to debug, but I’ll keep
this in mind as an eventual fallback.

- now when you debug you have a NOP you can use and not have to worry about moving instructions

Say, you're using a 5 bytes jmp instruction to jump to the
trampoline, so you need to replace 5 bytes at the breakpoint address.
But the instruction at the breakpoint address is shorter than
5 bytes. Like:

ADDR | BEFORE | AFTER
---------------------------------------
0000 | INSN1 (1 byte) | JMP (5 bytes)
0001 | INSN2 (2 bytes) | <<< thread T's PC points here
0002 | |
0003 | INSN3 (2 bytes) |

Now once you resume execution, thread T is going to execute a bogus
instruction at ADDR 0001.

That’s a relevant point.

I haven’t thought of it, but I think this can be mitigated by checking at
the time of replacing the instructions if any thread is within the copied
instructions bounds.

If so, I’ll change all the threads' pcs that are in the critical region to
point to new copied instruction location (inside the trampoline).

This way, it won’t change the execution flow of the program.

Yes, I think that would work, assuming that you can stop all threads,
or all threads are already stopped, which I believe is true with
LLDB currently. If any thread is running (like in gdb's non-stop mode)
then you can't do that, of course.

Thanks for pointing out this issue, I’ll make sure to add a fix to my
implementation.

If you have any other suggestion on how to tackle this problem, I’d like
really to know about it :).

Not off hand. I think I'd take a look at Dyninst, see if they have
some sophisticated way to handle this scenario.

Thanks,
Pedro Alves
_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
lldb-dev Info Page

Sincerely,

Ismail

Topic		Replies	Views
[Bug 33164] New: Support conditional breakpoints during expression evaluation LLDB	0	82	May 25, 2017
Breakpoint + callback performance ... Can it be faster? LLDB	8	79	February 13, 2017
[Bug 14348] New: Can't run an expression and stop at a breakpoint during its execution LLDB	1	75	August 15, 2013
Breakpoint + callback performance ... Can it be faster? LLDB	5	90	August 17, 2016
Evaluating the same expression at the same breakpoint gets slower after a certain number of steps LLDB	9	78	August 22, 2019

DWARF Operation	Occurrences
DW_OP_fbreg	2 114 612
DW_OP_reg	820 548
DW_OP_constu	267 450
DW_OP_addr	17 370
Top 4	3 219 980 Occurrences
---------------	---------------------------
Total	3 236 859 Occurrences

[RFC] Fast Conditional Breakpoints (FCB)

High Level Design

Implementation Details

Breakpoint Option

Trampoline

BreakpointInjectedSite

Condition Checker

Argument Builder

Variable Resolver

Unwinders

Related Topics