Problem unwinding from inside of a CRT function

In practice how slow is this for stepping over very lengthy calls? It sounds like LLDB still generates a call stack at every branch point, which while certainly faster than generating one at every instruction, still seems like it has the potential to be very slow. I still wonder about an algorithm such as one that differentiates between local control flow instructions (jmp, jnz, etc) and non-local control flow instructions (call). Single step the the local control flow instructions, and put a breakpoint at the return address of the non-local control flow instructions. This requires introducing some architecture specific dependencies into the ThreadPlan, but it doesn’t seem much more than those that are already in the thread plan. For example, the ThreadPlan currently has to scan forward to the next branch instruction to put the breakpoint. So this only adds the ability to differentiate between different types of branch instructions (e.g. those that transfer-and-forget, and those that transfer-and-return. Seems like this yields near zero-overhead in terms of executing the step-over.

As with last time, I’m probably missing something, and I’m mostly just thinking out loud :slight_smile:

In practice how slow is this for stepping over very lengthy calls? It sounds like LLDB still generates a call stack at every branch point, which while certainly faster than generating one at every instruction, still seems like it has the potential to be very slow.

At most, lldb only needs to get the current frame & its parent frame to run its stepping algorithm. It won't generate a full stack frame unless you ask it to. Getting two frames should be pretty quick.

I still wonder about an algorithm such as one that differentiates between local control flow instructions (jmp, jnz, etc) and non-local control flow instructions (call). Single step the the local control flow instructions, and put a breakpoint at the return address of the non-local control flow instructions. This requires introducing some architecture specific dependencies into the ThreadPlan, but it doesn't seem much more than those that are already in the thread plan. For example, the ThreadPlan currently has to scan forward to the next branch instruction to put the breakpoint. So this only adds the ability to differentiate between different types of branch instructions (e.g. those that transfer-and-forget, and those that transfer-and-return. Seems like this yields near zero-overhead in terms of executing the step-over.

So one thing to note: I implemented the "run from branch instruction to branch instruction" for two reasons, one was possible performance improvements, and the other because many architectures have some instructions that you can't reliably single step over (as you mentioned earlier.) So I wanted lldb to single-step as little as possible. The second goal was achieved by definition. But I tested the performance improvements for Mac native debugging and for iOS debugging, and while there was some speedup, for the things we cared about, the speedup was quite small, pretty much in the noise. And here I was saving you many round-trips since most code does a fair bit of work between branches. At best your suggestion would save one round-trip.

As Jason says, we spent a fair amount of effort on optimizing gdb-remote packet sequences. And for native process plugins like Windows etc. there's no excuse for these primitive operations to be slow. And computers are pretty fast these days. Maybe if you are talking to some really slow device this might start to be an issue. So I don't think that performance is a good reason to complicate the stepping algorithms in this way.

If you can get increased stepping accuracy with more predictive approaches, then that might be worth the effort. However, one of the nice things about the current method is if where you ended up confuses you, you can stop at that point & return control to the user. The more lldb does "predict where the code is going to go, set a breakpoint there and continue" the more it risks "step -> continue". That is much worse from a user's perspective than step-over -> step-in.

Jim

I'm on Hexagon; we're using the default unwinder set up in source/Plugins/ABI/SysV-hexagon/ABISysV_hexagon.cpp.

Dealing with a hardware issue on some devices will mean I need to check for certain hardware revisions and modify some of the data read from the stack. I'd like to intercept a read, check for the problematic HW, and modify the data.

Ted

Specifically -- it will take three packets. The initial stop packet gives us enough registers to know what the current stack frame is. lldb will make one memory read of the stack memory, which gives it enough to reconstruct the caller stack frame. T packet, memory read, memory read result. Then lldb knows whether to continue stepping, set a breakpoint & resume, or stop stepping.

"The initial stop packet gives us enough registers to know what the current
stack frame is."

Which registers would you suggest I have my simulator return in the stop
packet?

See if you can expedite all of your GPR registers in your stop reply and qThreadStopInfo responses. We are typically slowed down by packet send/receive, so we try to minimize them by always including all registers in the stop reply and qThreadStopInfo. Otherwise we will send a p packet to read each register we need.

Below is an example from debugging TextEdit.app on MacOSX:

< 5> send packet: $c#63
send packet: \x03
< 587> read packet: $T11thread:3fe26e;qaddr:7fff77e5c480;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:0540001000000000;01:ffffffff00000000;02:28e1bf5fff7f0000;03:0000000000000000;04:90e2bf5fff7f0000;05:0608000700000000;06:70e1bf5fff7f0000;07:28e1bf5fff7f0000;08:0b1b000000000000;09:ffffffff00000000;0a:000c000000000000;0b:0602000000000000;0c:000c000000000000;0d:0000000000000000;0e:90e2bf5fff7f0000;0f:0b1b000000000000;10:2e351a88ff7f0000;11:0602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;metype:5;mecount:2;medata:10003;medata:11;#00
SendInterrupt () - sent interrupt, private state stopped
< 25> send packet: $qThreadStopInfo3fe292#96
< 542> read packet: $T00thread:3fe292;qaddr:10009c180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7101000200000000;01:0000000000000000;02:b8b5090001000000;03:0000000000000000;04:0300000000000000;05:0000000000000000;06:40b6090001000000;07:b8b5090001000000;08:0100000000000000;09:0000000000000000;0a:d0b5090001000000;0b:4602000000000000;0c:5892398bff7f0000;0d:c04e2077ff7f0000;0e:d0b5090001000000;0f:00b6090001000000;10:2e921a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe293#97
< 542> read packet: $T00thread:3fe293;qaddr:100581180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000010000000000;02:180f580001000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:500f580001000000;07:180f580001000000;08:0000000000000000;09:0000000000000000;0a:0000000000000000;0b:4602000000000000;0c:ff10008000000000;0d:0316000000000000;0e:0010580001000000;0f:1900000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe294#98
< 542> read packet: $T00thread:3fe294;qaddr:100604180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:0500000200000000;01:f82d600001000000;02:f81c600001000000;03:d8490a0000000000;04:d8498a0901000000;05:0480000000000000;06:d022600001000000;07:f81c600001000000;08:0900000000000000;09:2006000000000000;0a:d8498a0901000000;0b:4602000000000000;0c:0000000000000000;0d:0000000000000000;0e:2800000000000000;0f:2026600001000000;10:c67f1a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe2a3#bf
< 542> read packet: $T00thread:3fe2a3;qaddr:100687180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000000000000000;02:186f680001000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:506f680001000000;07:186f680001000000;08:ff1000a000000000;09:0000000000000000;0a:0000000000000000;0b:4602000000000000;0c:ff08000000000000;0d:0717000000000000;0e:0070680001000000;0f:1500000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe2b4#c1
< 542> read packet: $T00thread:3fe2b4;qaddr:1056f5180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000010000000000;02:184f6f0501000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:504f6f0501000000;07:184f6f0501000000;08:0200000000000000;09:0000200001000000;0a:0000000000000000;0b:4602000000000000;0c:ff08008000000000;0d:1333000000000000;0e:00506f0501000000;0f:1500000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe2c3#c1
< 542> read packet: $T00thread:3fe2c3;qaddr:107e81180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000010000000000;02:180fe80701000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:500fe80701000000;07:180fe80701000000;08:de182f88ff7f0000;09:0000300001000000;0a:0000000000000000;0b:4602000000000000;0c:ff20008000000000;0d:0b72000000000000;0e:0010e80701000000;0f:2100000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe2d2#c1
< 542> read packet: $T00thread:3fe2d2;qaddr:109e81180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000010000000000;02:180fe80901000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:500fe80901000000;07:180fe80901000000;08:0100000000000000;09:0000000000000000;0a:0000000000000000;0b:4602000000000000;0c:ff08008000000000;0d:1b50000000000000;0e:0010e80901000000;0f:1500000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe2f3#c4
< 542> read packet: $T00thread:3fe2f3;qaddr:109f84180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000000000000000;02:183ff80901000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:503ff80901000000;07:183ff80901000000;08:ff1000a000000000;09:0000000000000000;0a:0000000000000000;0b:4602000000000000;0c:ff08000000000000;0d:0380000000000000;0e:0040f80901000000;0f:1500000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe56a#c5
< 542> read packet: $T00thread:3fe56a;qaddr:10a703180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:7001000200000000;01:0000000000000000;02:182f700a01000000;03:0000000000000000;04:0400000000000000;05:0000000000000000;06:502f700a01000000;07:182f700a01000000;08:1b00000000000000;09:0000400001000000;0a:0000000000000000;0b:4602000000000000;0c:ff08000000000000;0d:2f97000000000000;0e:0030700a01000000;0f:1500000000000000;10:46891a88ff7f0000;11:4602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00
< 25> send packet: $qThreadStopInfo3fe56d#c8
< 542> read packet: $T00thread:3fe56d;qaddr:10a7a3180;threads:3fe26e,3fe292,3fe293,3fe294,3fe2a3,3fe2b4,3fe2c3,3fe2d2,3fe2f3,3fe56a,3fe56d;00:1f00000100000000;01:ffffffff00000000;02:481f7a0a01000000;03:0000000000000000;04:b0207a0a01000000;05:0608000700000000;06:901f7a0a01000000;07:481f7a0a01000000;08:039d000000000000;09:ffffffff00000000;0a:000c000000000000;0b:0602000000000000;0c:000c000000000000;0d:0000000000000000;0e:b0207a0a01000000;0f:039d000000000000;10:2e351a88ff7f0000;11:0602000000000000;12:0700000000000000;13:0000000000000000;14:0000000000000000;#00

Above you will see "XX:YYYYYYYYYYYY" where XX are two hex digits for the register number followed by a : and the register bytes. Note these register bytes are in native endian format, not big endian.

Not we also list all of the process' threads using the "threads:a,b,c,d;" format in each stop reply. It doesn't take up much space and it also stops your from having to make two packets round trip calls using qfThreadInfo and qsThreadInfo.

Greg

BTW, Zachary's question here was not a dumb one. A full stack backtrace is not cheap, so when hacking on lldb, never ask for more frames than you need. It isn't as expensive as it was in gdb since we separate unwinding the pc/cfa from unwinding the arguments up the stack, but still, a full stack trace should only get done in response to some high level user request.

Jim

Jim & Jason,
Thanks for explaining this in details. I also like the ability provided
by functions like QueueThreadPlanForStepOverRange. You can experiment
by providing your own step plan if you want in your process plugin.

Regards,
Abid

Regarding the function bounds, I thought about this some, and I’m not sure if this is going to be possible. Consider a system library, like the CRT, being linked against with no symbol information. Where are the function bounds going to coem from? There’s nothing in the symbol table of the COFF file, and there’s no debug info. And since we’re talking about an x86 binary, unwind info is not part of the ABI. There’s just a huge block of code in the code section. Even if we do have symbols (which is how we would determine function bounds for code we have no control over), we will only have “public symbols” released by Microsoft. Public symbols do not consist of information about every function, and practically any non-trivial call is going to at some point transfer control to one of the private functions that have no symbol information.

For generic backtracing, that's going to be a problem. For instance, if you want to interrupt & do a backtrace at some random point, you may get a frame wrong. If the MS unwind library gets this right, by all means use that as a host function. If they get that right, I'd be interested to see how they do that (though we probably will never know...) After all, just doing "find the instruction just before from the current pc" is not entirely trivial on x86...

Anyway, for "step-over" and so on unexported functions will be less of a problem in practice, since for source level stepping, you are generally stepping from your user code to some other routine. So the library routine you are stepping into has pretty much got to be an exported routine. And since source level stepping doesn't by default stop in code with no debug info (though this is settable) you're just going to get out of there...

Now if you "step-into" and stop in system routines, and then try to use step-over-inst through the instructions in that routine, you may get into some more trouble.

That is the sort of situation where being able to understand the caller instruction(s) could help out. But step-over-inst (gdb's nexti) in code without symbols has generally been a bit of a crap shoot both in lldb & gdb. Dunno how well the Windows debugger does with this.

Jim

It’s funny you should mention that, because one of my favorite commands on Windows is “disassemble backwards”. I actually wanted to implement this in LLDB eventually. But I haven’t spent time to figure out how Windbg handles reverse disassembling with no symbol information, or if it works at all.

The simple way to do this is to go to the first symbol north of your current address, disassemble forward till you get to the current PC, then display the last N instructions in that disassembly (maybe caching the disassembly for extra credit.)

There are probably other tricky ways to do this if you know lots about x86, but that's not my area of expertise.

Jim

There was an old framework on Mac OS X that would find function bounds in an unknown code base (back before we had function start addresses) -- on x86 it would do as Jim suggests, scan backwards, look for a typical function prologue byte sequence (push rbp; mov rsp, rbp) and then disassemble forward again to see if it got to the same address.

Any backwards scanning algorithm is surely going to be looking for a commonly-occuring sequence of instructions. The function prologue is as good as any, I wouldn't be surprised if the MS disassembler did the same thing.

We find ways in MachO that don't involve symbols:
- EH frame
- LC_FUNCTION_STARTS
- dynamic linker stub info we parse
- dynamic linker trie information for resolver symbols

None of these are symbol table based and yet we can find function bounds.

From:

I see:

Program Exception Data
      Some architectures (including the IA-64) don't use frame-based exception handling, like the x86 does; instead, they used table-based exception handling in which there is a table containing information about every function that might be affected by exception unwinding. The data for each function includes the starting address, the ending address, and information about how and where the exception should be handled. When an exception occurs, the system searches through the tables to locate the appropriate entry and handles it. The exception table is an array of IMAGE_RUNTIME_FUNCTION_ENTRY structures. The array is pointed to by the IMAGE_DIRECTORY_ENTRY_EXCEPTION entry in the DataDirectory. The format of the IMAGE_RUNTIME_FUNCTION_ENTRY structure varies from architecture to architecture. For the IA-64, the layout looks like this:
DWORD BeginAddress;
DWORD EndAddress;
DWORD UnwindInfoAddress;

Not sure if it covers all functions, but it does sound like it would at least help a little.

If there really is no way to figure this out, we might need to resort to looking for function bounds by disassembling and watching for callq instructions to look for internal calls. We should be able to identify all external symbols via the exports.

Greg

I’m going to be revisiting this soon, but one thing I was never clear about.

If I step over a function call, does it do this algorithm of single stepping a call instruction and then running until the next branch point all the way down, or does it only do it one level deep? In other words, say I have this code

void baz() {
printf(“Test”);
}

void bar() {
baz();
}

void foo() {
bar();
}

and I’m inside of foo(), and I want to step over bar. It will single step the call, end up inside of bar. Then run to the next branch point. Does it now single step baz, end up in baz, run to next branch point, and then single step printf, and then continue this all the way down? Or once it figures out where it is inside of bar, that’s sufficient to let it run until the return address?

If it’s the former, where it does this all the way down, what if some function deep down in the call chain is a hand-written assembly function that uses a custom calling convention?

The latter. It will stop at the first instruction of "bar()" and then set a BP on the return address and continue to it, then continue with the source level single step over you started with (if there are any instructions left after the return BP is hit.

Ok crisis averted. Part of why this thread dragged out for so long is because I kept thinking it did this all the way down and it seemed unnecessary and problematic to allow code arbitrarily deep on the stack to mess up unwinding.

Thanks.

Yeah, it would be kind of silly to recurse all the way down before stepping all the way back out. When you are stepping IN, you have to at least get all the way to the target function - stepping through trampolines - because otherwise it is hard to tell whether the function you will eventually land in has debug info or not. But for step OVER you can just get right out of anything you've stepped into.

Jim