Problem unwinding from inside of a CRT function

I’m not not-trusting that you know what you’re doing. I’m trying to understand it for myself, so that I can make the necessary changes to get it working on Windows.

On the other hand, Windows does have plenty of different rules and assumptions when it comes to how it generates code. So while I’m confident that you guys have thought about it and made something that works well within the context of non-Windows, it’s a given that some of the functionality is going to be broken when the assumptions change. Not because you didn’t know what you were doing, but because the code wasn’t designed with these assumptions in mind. Otherwise it would have just already worked and this thread would have never been created :slight_smile:

So it’s not enough for me to just say “I have to trust that it works, therefore I must be doing something wrong”, but I also need to understand the architecture and the details well enough to be able figure out if it doesn’t work because of some fundamental differences in Windows vs. non-Windows, or if it doesn’t work because I have a bug somewhere in my process plugin, or something else.

One of the questions I keep asking myself is: Why, when using one of the native Windows debuggers such as MSVC or WinDbg, if I step over a call, does it always work even if the called function has no debug info, no unwind info, and uses FPO?

Maybe it’s related to them having better function bounds in their COFF parser, as you suggested. I’m going to look into that, but I just want to re-emphasize that none of my posts, including this one, are intended to second guess anyone’s decisions. But at the same time it’s only natural to expect that since many of the assumptions were made without Windows in mind, they may prove to be slightly wrong.

There’s not very many detailed design documents about how things work, and in certain areas code documentation is sparse. So the purpose of me asking is simply to understand how it works.

Hope this makes sense.

Let me echo what Vince is saying; I’m about to dig into the unwinder and currently have no idea how it works. The conversation in this thread has helped a ton.

We’re currently using the default unwinder, but I need to add some special case stuff. Any thoughts on where to look to see a custom implementation?

This is Jim's area so I should let him reply. Jim described earlier how when lldb is instruction stepping over an address range, it uses the disassembler to identify instructions that may branch and uses breakpoints to execute between those (so we don't need to single instruction step the entire range). This is a relatively new feature to the stepper code -- maybe added within the last year -- and it's the first time that the ThreadPlan type algorithms had access to that knowledge.

Now that we have a disassembler at our disposal, it is reasonable to ask if the disassembler could flag function call instructions and the ThreadPlan could know to single instruction step and provide a hint to the unwinder "Hey, we just stepped into a new function, we haven't executed any instructions in it yet".

The ABI provides a special unwind plan for exactly these scenarios -- CreateFunctionEntryUnwindPlan() -- so all the pieces are likely available. It's a matter of plumbing it all together across the layers.

The reason this hasn't been necessary to-date is that all of the platforms lldb operates on, it has the addresses of all the functions and stubs/trampolines (CRT functions, PLT routines) in the binaries, or it has unwind information (eh_frame instructions) that tell it how to unwind from address ranges even if it doesn't necessarily know the start address of the functions.

This makes the impetus to add call/bl knowledge to the ThreadPlans a lot less important - it's not fixing a problem that any of us are seeing today.

And even if we do the work of identifying the "step into a new function" sequence during stepping, we STILL need to be able to unwind from arbitrary stop locations in your process accurately. You may interrupt the process at any instruction location -- or the program may crash at nearly any instruction location -- and you need to be able to backtrace out of there. Even if there are functions in the middle of the stack that don't use the frame pointer. Even if you're sitting at the first instruction of a CRT function or you're in the middle of a frameless leaf function.

In my opinion, expending a lot of energy on making the ThreadPlans know how to unwind from the first instruction is ignoring the real problem of being able to unwind accurately from all instruction locations. It's not worth doing. Make the unwinder work from any location on your platform - if it can't, that's the problem that needs to be fixed. I agree I think it would be interesting if the ThreadPlans could identify to the Unwinder that it has just stepped in to a function for even better reliability in a particularly tricky unwind scenario. But it's not a panacea, if that's the only thing you fix and rely on "walk the frame chain on the stack" to backtrace, you're going to have a horrible debugger experience. Even if you can accurately walk the stack you won't get register save locations, for instance, so when the debug info says a variable is stored in rbx in the middle of the stack, and rbx was saved by the callee function to the stack, you won't be able to retrieve it for the user.

J

Ted, what kind of special stuff are you looking to add? If you're on an i386/x86_64/armv7/arm64 architecture system, lldb's unwinder should be doing a good job as long as you have function start addresses or eh_frame unwind instructions.

The Unwind class is the top level one that is asked "Hey can you give me another stack frame".

The RegisterContext class is the one that provides register values for a stack frame. Frame 0 has a live register context -- you get the current reg values from the cpu -- but above frame 0 it's all about retrieving values from the stack.

Unwind creates a stack frame and associates it with a RegisterContext, then returns that StackFrame to the thread.

The standard unwinder for x86/arm/arm64 is UnwindLLDB and RegisterContextLLDB.

J

1) In the original implementation, (and this is how gdb does it, BTW) lldb single-stepped till "something interesting happened." As an optimization, when you are doing any kind of step through source range, I changed lldb so it runs from "instruction that could branch" to "instruction that could branch" using breakpoints. Then when it hits an instruction that could branch it single steps that instruction, and then figures out from where that went what to do next.

Nice.

BTW, if it were helpful to figure out what to do next, we could store some info (the old stack frame or whatever) when we hit a branch instruction, and then use it when the single-step completed. I haven't needed to do that yet, however; Jason's always been able to get the unwinder work reliably enough not to require this.

First, we should definitely teach the Windows unwinder to fall back to frame pointers if no debug info is present. That's an obvious win.

Yes.

However, there are lots of environments (not just Windows) where unwinding is unreliable due to third party libraries, so it'd be nice if we can get by without unwinding.

2) If the single step pushes a frame, and we are "stepping over", lldb sets a breakpoint on the return address and continues. When the return address is hit (for the current frame of course since it could be hit recursively) then we continue stepping as above.

Any objection to asking the target if the previous opcode is something typically used for a call (x86 call, ARM bl), single step, and then load the retaddr or link register? Is that hard to thread through? I suppose it would fire on 32-bit x86 PIC sequences (call 0 ; pop %ebx), but that won't hurt.

The agents that manage stepping, etc (ThreadPlans) are persistent through-out the operation they manage. They live in a little stack of operations, so there's always one entity per step type operation (i.e. if you next, hit a breakpoint, call a function, hit a breakpoint in the function, next again, etc. each of these operations has a Thread Plan to govern it.) It wouldn't be hard to store some data in them, and use it either to convey hints to the unwinder when you step in, as Jason suggests, or whatever you needed to do.

One thing, the ThreadPlans are currently architecture agnostic, and I'd like to keep it that way, so whatever you need to express should be done in an architecture-independent way. That might add a little complexity, but one I think will be worthwhile long term.

But I agree with Jason that, from a resource allocation standpoint, time spent on the unwinder would benefit the whole system much more than time spent hacking around its deficiencies in the ThreadPlans.

Jim

I think the first thing for us to try on the Windows side is to provide a simple implementation of an ABI plugin. Earlier Jason mentioned that if you’re at the start of a function LLDB will try to create a default unwinder by going through the ABI plugin, and that isn’t happening for us on Windows, presumably because something somewhere is (correctly) objecting to the use of a SystemV ABI plugin on Windows. So my first step will be to figure that out, make sure UnwindLLDB can create a default unwinder through my MicrosoftABI plugin, and then see how far that gets us.

Sure. I think we take a lot of these complexities for granted at this point. It's fun to be reminded from time to time that this stuff is a little above flipping burgers...

None of us here have worked on MSVC, so I don't know if they use fundamentally different algorithms for stepping. But there really aren't that many different ways you could do it, at least that I can think of. If I had to guess, I would bet they make their unwinder really smart, because there are so many clients of that, and if that works well, then stepping is pretty straight-forward. And then given the age of the tools they probably have a bunch of special cases layered on top to make things work beautifully.

Jim

I don't think it's practical to expect the unwinder to *always* work, but I
agree it needs to work most of the time. There are situations when unwind
data just isn't available, like when trying to step over a call to a
function JITed by something you don't control.

Personally, I think it will be a lot more work to make the unwinder
understand PDB information than it will to let ThreadPlan know what looks a
call instruction looks like. Clang will also be generating DWARF for the
forseeable future, so adding PDB reading only helps us debug third party
code, which not very interesting. Adding the call-recognition code to
ThreadPlan will solve the cross-platform problem of stepping over a call to
a function with no CFI. Would you be OK with patches in that direction?

We do eventually want to use PDBs to support unwinding, but we expected to
get by with just DWARF for some time to come.

1) In the original implementation, (and this is how gdb does it, BTW) lldb single-stepped till "something interesting happened." As an optimization, when you are doing any kind of step through source range, I changed lldb so it runs from "instruction that could branch" to "instruction that could branch" using breakpoints. Then when it hits an instruction that could branch it single steps that instruction, and then figures out from where that went what to do next.

Nice.

BTW, if it were helpful to figure out what to do next, we could store some info (the old stack frame or whatever) when we hit a branch instruction, and then use it when the single-step completed. I haven't needed to do that yet, however; Jason's always been able to get the unwinder work reliably enough not to require this.

First, we should definitely teach the Windows unwinder to fall back to frame pointers if no debug info is present. That's an obvious win.

Yes.

However, there are lots of environments (not just Windows) where unwinding is unreliable due to third party libraries, so it'd be nice if we can get by without unwinding.

2) If the single step pushes a frame, and we are "stepping over", lldb sets a breakpoint on the return address and continues. When the return address is hit (for the current frame of course since it could be hit recursively) then we continue stepping as above.

Any objection to asking the target if the previous opcode is something typically used for a call (x86 call, ARM bl), single step, and then load the retaddr or link register? Is that hard to thread through? I suppose it would fire on 32-bit x86 PIC sequences (call 0 ; pop %ebx), but that won't hurt.

The agents that manage stepping, etc (ThreadPlans) are persistent through-out the operation they manage. They live in a little stack of operations, so there's always one entity per step type operation (i.e. if you next, hit a breakpoint, call a function, hit a breakpoint in the function, next again, etc. each of these operations has a Thread Plan to govern it.) It wouldn't be hard to store some data in them, and use it either to convey hints to the unwinder when you step in, as Jason suggests, or whatever you needed to do.

One thing, the ThreadPlans are currently architecture agnostic, and I'd like to keep it that way, so whatever you need to express should be done in an architecture-independent way. That might add a little complexity, but one I think will be worthwhile long term.

BTW, if anybody wants to try this, this might be a good setup: Have some UnwindHint class, that you would create from a thread and the ABI plugin when you are sitting on a branch. That would fill in whatever might help it to unwind after the step. For instance, if an x86 ABI saw it was sitting on a call instruction, it could set "is_first_instruction" in the UnwindHint class. Then when the thread plan stopped it could get the new stack, passing the unwinder back the UnwindHint object it got.

Something like that might be good.

Jim

We crossed paths... See the "UnwindHint" notion I sent in another mail on this thread. I like that approach better than having the ThreadPlans know about calls because it not only gets stepping right, but passing the hint to the unwinder would not only get the step in to stop as it should, but would also set the unwinder on the right path so that the backtrace at this point would also be correct... That's important for the user when we do step-in rather than step-over.

Jim

Cool, that seems reasonable.

One thing to be careful about if you do this is that there's no guarantee that the ThreadPlan will be the first person to ask for the unwind, so there might be a bad unwind sitting around. It would actually be a shame if anybody else got their hands on a bad unwind when we knew how to make a better one. So maybe the UnwindHint should hang off the Thread instead, so anybody who goes to unwind will see it.

Jim

The details will be a little tricky on this one. Frame 0 is probably created as soon as the process stops - before the ThreadPlan does anything -- and we need to install the UnwindHint in that frame 0 RegisterContextLLDB before we start asking anything about frame 1.

StackFrameList is where we call the Unwinder to get stack frames -- e.g. see StackFrameList::GetFrameAtIndex().

If the ThreadPlan had a chance to execute code before we try to retrieve stack frame 1 (which it seems like it would -- this is a private stop event and the ThreadPlan is the guy who is going to ask about stack frame 1 itself), we'd need some way to say give me the RegisterContext for frame 0 and stuff the UnwindHints in that object.

e.g. see how RegisterContextLLDB::InitializeZerothFrame() sets its m_full_unwind_plan_sp. This is the UnwindPlan that will be used to find frame 1's pc, stack pointer, frame pointer, etc, when requested.

J

So the steps I was thinking of were:

1) ThreadPlan notices it is on a branch
2) ThreadPlan calls Thread::SetUnwindHint() which calls into the ABI to write down whatever hints it thinks are appropriate and stuff them into the thread.
3) ThreadPlan causes the process to step one instruction
4) Whoever goes to create the RegisterContexts & Frames for that thread will have access to the Thread, and thus to the UnwindHint. So the ThreadPlan doesn't have to get involved.

I think that should work.

Jim

Yeah, good point. In that case, you could just update RegisterContextLLDB::SavedLocationForRegister(). Add a little code near the top that looks to see if there is the thread has an UnwindHint saying we've just stepped-in, and this is frame 0, then we use the CreateFunctionEntryUnwindPlan() unwind plan instead of whatever plan we would normally use here.

I would vote to try and get Windows stack backtracking working using all built in code and not having to resort to using the Windows native DLL which aids in backtracing. This would allow remote debugging to work seamlessly and not have a really bad remote debugging to windows experience, but local debugging is just fine. We can probably use the native DLL to help us track down cases that we get wrong and it can help us train our windows backtracer to make sure it works as well as possible.

Although I would be ok with having support for Windows PDB files in a new SymbolFileWindowsPDB only happen on windows because the DLL is the only way to access the debug info and the file format is not documented...

Greg

Hopefully making the backtracer smart enough on Windows will be sufficient. Either way, it seems like the first step, so we’ll see. I have some ideas about cases that might be difficult to handle, but we can get into it more after the backtracer works well enough on Windows to see how much of a problem it will be.

In a properietary debugger that we developed in house, we spent quite a
bit of effort on making this work mixing emulation and symbol
information. It made a real improvements when debuggign remote
targets using slow connections. There was always the fall back on
stepping individual instruction when it did not work.

GDB also has range stepping thing now.
https://sourceware.org/ml/gdb-patches/2013-03/msg00450.html

Regards,
Abid

How about:

2 for (int i=0; i<100; i++)
3 -> printf ("i = %i\n", i); //
4 printf ("this won't be executed after line 3 except for the
last
time\n");

If you set a breakpoint on line 4 after line 3 when you will fail to
return to line 3 when single stepping.

How about:

2 -> goto carp;
3 puts("won't ever be executed");
4 carp:
5 puts("will get executed");

If you set a breakpoint at line 3 you won't stop.

Another:

2 -> throw foo();
3 puts("this will never get hit");

If you set a breakpoint at line 3 you will never hit it.

Please trust that we know what we are doing when it comes to single
stepping. I am glad you are thinking about how things are done, but
just be sure think about the problem in a wider scope than "the code
I
am thinking about is linear" and think about all sorts of single
stepping and what you would expect to happen.

In a properietary debugger that we developed in house, we spent quite a
bit of effort on making this work mixing emulation and symbol
information. It made a real improvements when debuggign remote
targets using slow connections. There was always the fall back on
stepping individual instruction when it did not work.

GDB also has range stepping thing now.
Yao Qi - [PATCH 0/7] Range stepping

That's not the same thing. That change is to have gdb send the step range to gdbserver and then gdbserver does the single stepping till it is outside the range. That is great when you are doing remote debugging since it reduces the number of packets you have to send and receive. But somebody is still single stepping.

In lldb, we don't single-step from branch to branch, we set a breakpoint on the next branch within the range and continue. The only times lldb single-steps are to step over breakpoints, and branches. I'm sure gdb could easily do this trick as well, though I doubt gdbserver could. gdb does know about what instructions do (e.g. for ARM chips without hardware single-step you have to emulate every instruction so you can set breakpoints on the destination and run there.) But at least last time I looked gdbserver was much lighter-weight than this.

Jim

Jim points out that this is a different approach than lldb took -- it's pushing some limited amount of single instruction stepping down into the remote stub.

The cost of single instruction stepping can be broken down into (1) time to stop the inferior process, (2) time to communicate inferior state between stub and debugger, and (3) time for the debugger decide whether to resume the process or not.

The gdb approach reduces 2 & 3. lldb's approach is addressing all of 1-3. A single source line may have many function calls embedded within it -- printf("%d\n", f(g(x))); -- so lldb will still be need to stop the inferior 4 more times than gdb for this sequence (stop at the point of the call instruction, then single instruction step into the call -- whereas with gdb's approach the stub will single instruction step into the call and then report back to gdb).

In lldb we've put a lot of time in optimizing #2. Besides getting rid of the "acks" in the gdb-remote protocol by default (needed for an unreliable transport medium, like a raw serial connection to a target board), we looked at what pieces of information lldb needs to decide whether to keep stepping or stop. It needs to know the stop reason, it needs to know the pc, it needs the stack pointer, and it probably needs the frame pointer. So in the "T" packet which the stub sends to indicate that the inferior has stopped, we have a list of "expedited registers" - register values that the stub provides without being asked.

The result is that every time lldb needs to step a single instruction within a function bounds, there are two packets sent: The "T05" packet indicating the inferior stopped, and lldb sending back another "vCont;s" packet saying to instruction step again, if appropriate. The overhead of #2 has been dramatically reduced by this approach. (think about a scenario where there are no expedited registers in the T packet - the debugger is going to need to ask for each of these registers individually, or get all registers via the g packet, and it's going to be really slow.)

The approach Jim did with lldb does assume that you have a disassembler with annotations regarding whether an instruction can affect flow control - branches, calls, jumps, etc. The llvm disassembler includes these annotations. Last time I looked at the disassembler gdb is using, it doesn't include this kind of information about instructions.

J