Continuing from a breakpoint with multiple threads

[I’m trying to make TestBreakAfterJoin work on Windows.]

I’m unclear how continuing from a breakpoint in a multi-threaded inferior is supposed to work.

A breakpoint is set, and the inferior runs until one of its threads hits the breakpoint. The user then selects continue.

The thread that had hit the breakpoint has a thread plan type of ThreadPlanStepOverBreakpoint, which causes all of the other threads to be set to state eStateSuspended. The thread that had hit the breakpoint then steps beyond the breakpoint, and the breakpoint is restored. The thread is then resumed again.

But the other threads are all still suspended, causing the inferior to deadlock.

The question is: Where should the other threads have their resume states set back to a running state?

Adrian

When stepping over a breakpoint, the ThreadPlanStepOverBreakpoint gets pushed, handles the single instruction step - during which it does suspend the other threads - then it gets popped. When you next resume, whatever plan was handling the stepping before the breakpoint was hit will resume with whatever policy for running other threads it was using.

So it's up to the plan that was on the stack before the "StepOverBreakpoint" was pushed to decide this. Most of the more complex plans (like step-over/step-into) try to keep the other threads from running if possible (unless the user instructed otherwise) but they also will let all the threads run if there's something going on that might deadlock. For instance, if you are doing "next" and we step through straight-line instructions in a function, we will only run the one thread you are stepping in (by default, you can control this with options to the "thread step-over" command. But if we step into a function, we set a breakpoint on the return address and then run with all threads resumed because stepping out of a function could run arbitrary code.

Anyway, was this a theoretical question, or do you have some instance where you are actually seeing a deadlock?

Jim

Thanks for the info.

This is not theoretical. I’m trying to get TestBreakAfterJoin to pass on Windows. Step 1 was to convert it use instead of <pthreads.h>. Step 2 was to fix some minor issues in TargetThreadWindows.

But now the inferior deadlocks because the one thread that’s not suspended is waiting on the ones that are. Once the ThreadPlanStepOverBreakpoint plan is popped, the current plan is ThreadPlanBase, which, as far as I can tell, does nothing to resume suspended threads.

I’ll compare this to what happens on another platform to see if there should be some other thread plan in use.

Adrian.

The way this works is that when we go to resume the process, all the thread's get asked whether they need to stop other threads to implement whatever strategy they are currently pursuing. That query ends up calling the currently active thread plan's "StopOthers" method. If one thread returns true to StopOthers, then that thread will get to run solo. If more than one thread returns true I do a little round robin to pick which one gets to go. But ThreadPlanBase::ShouldStop returns false, so if all your threads are running just the ThreadPlanBase, then they should all resume.

This all happens in ThreadList::WillResume.

Note I started to add some commands to manipulate the thread plans - of which "thread plan list" is the relevant one. The work isn't done yet (for instance I should actually DO something with the --internal and --verbose options, but for now I only print user visible plans, not implementation only plans. Anyway, if you are poking around in this area that might be of some use.

Jim

The way this works is that when we go to resume the process, all the thread’s get asked whether they need to stop other threads to implement whatever strategy they are currently pursuing. That query ends up calling the currently active thread plan’s “StopOthers” method.

Right, and since the ThreadPlanStepOverBreakpoint responds true to StopOthers, all the other threads get suspended. Once the breakpoint is restored and stepped over and the ThreadPlanStepOverBreakpoint is popped, the rest of the threads are still suspended.

But ThreadPlanBase::ShouldStop returns false, so if all your threads are running just the ThreadPlanBase, then they should all resume.

Except that ShouldStop is not called for threads that are already suspended (Thread::ShouldStop has an early out if the resume state is eStateSuspended). So I still don’t see how those threads can ever get out of the suspended state.

Currently ThreadWindows::WillResume() looks like this:

void
TargetThreadWindows::WillResume(lldb::StateType resume_state)
{
SetResumeState(resume_state);
}

I originally put this code in because that’s what one or two of the other plugins did and I wasn’t sure what the “correct” thing to do was. I’m not sure if it’s correct though, or if it could be a cause for the bug. But if the resume state is eStateSuspended as you say, then that suggests that something lower level already decided that this thread should continue to be suspended after the user continues. So the bug might actually still be earlier.

There’s not a lot of logging in ThreadList::WillResume, I wonder if it would be worth adding some?

I think Zach’s right. The only plugin that calls SetResumeState inside WillResume is POSIXThread, and that seems to be overridden by FreeBSDThread.

If I remove the call of SetResumeState from TargetThreadWindows::WillResume, everything starts to work.

I’ll look into adding some logging in ThreadList::WillResume.

Thanks everyone.

I’m not quite ready to throw the blanket over this one yet :slight_smile:

What was the value of resume_state when it called WillResume()? It sounds like it was eStateSuspended, which if that’s the case, then it still seems like something deeper inside of LLDB’s thread plans is confused about something, because calling WillResume(eStateSuspended) means “The process is seriously about to resume, and when it does, this thread is not going to remain suspended”.

But it’s possible I’m not understanding something about the interaction between resume states and the thread plans.

Do you have any insight here Jim?

s/is not going to/is going to/

ThreadList::WillResume, near the bottom, ends up calling ShouldResume(eStateSuspended) on the other threads, because the ThreadPlanStepOverBreakpoint::StopOthers says it should.

Thread::ShouldResume effectively passes its parameter on to the thread’s WillResume, with the comment, “Let Thread subclasses do any special work they need to prior to resuming.”

Ahh, I find it a little confusing that ShouldResume() calls WillResume(). ShouldResume() to me sounds like a function that should be const. It seems like it should just ask a question and gets a yes/no answer, but not modify anything.

But looking over the code, it looks like WillResume() actually means “I don’t know if you need to resume or not, but if you do, then do it, and if you don’t, then tell me that by returning false”

I agree that the naming of these functions are suboptimal indicators of what they do.

One thing to keep in mind is that there are two kinds of "resume state". There's the user-specified resume state, which is set by Thread::SetResumeState, and is stored in m_resume_state in the thread. It should start out "running" and never change unless some user command does so.

I don't know why you are ever calling "SetResumeState" on Windows threads, that's not something you should have to do. Who is changing this value? ProcessGDBRemote calls this in one place, but that is when we are restarting a process to kill it, and we're just trying to keep all the other threads from doing something annoying like crashing while we're trying to kill it. In general, though, internal to lldb you should never need to change this.

The more relevant state is the current thread resume state, which records what the thread is going to do on this particular resume. This state is stored in m_temporary_resume_state. That will control the state of this thread on resume, but note that should only ever get consulted on the "should stop" side of the thread plan logic. And of course if m_resume_state is eStateSuspended, that should short-circuit any of our normal logic about this thread.

There is another tricky bit about the whole resume negotiation, which is that you may not need to resume the target at all to perform the continue action. For instance, if you have nested inline functions which begin on the same address, then "step in" on that thread just moves the virtual "inline depth" to the next younger stack frame, but doesn't move the PC at all. So ShouldResume can return false.

So the first step is to take out the call to SetResumeState, that's not something you should ever have to do. Then if that causes problems, we should figure out what's going wrong on Windows.

Jim

One thing to keep in mind is that there are two kinds of "resume state". There's the user-specified resume state, which is set by Thread::SetResumeState, and is stored in m_resume_state in the thread. It should start out "running" and never change unless some user command does so.

I don't know why you are ever calling "SetResumeState" on Windows threads, that's not something you should have to do. Who is changing this value? ProcessGDBRemote calls this in one place, but that is when we are restarting a process to kill it, and we're just trying to keep all the other threads from doing something annoying like crashing while we're trying to kill it. In general, though, internal to lldb you should never need to change this.

The more relevant state is the current thread resume state, which records what the thread is going to do on this particular resume. This state is stored in m_temporary_resume_state. That will control the state of this thread on resume, but note that should only ever get consulted on the "should stop" side of the thread plan logic.

This comment wasn't very clear. What I meant was that it is legit to ask "did this thread do anything interesting on last resume" when you stop, for instance to decide whether you need to re-fetch its stop reason, etc. But you shouldn't decide whether the thread is going to run or not on the NEXT resume based on what it did in the previous resume.

One other bit, just in case you're looking at this more closely... The other place where we call SetResumeState in lldb is in "process continue" and "thread continue". "process continue" sets all the states to running. The idea is that "continue" should get all the threads that aren't user-suspended running, but I don't think the code there actually does anything useful. I'll play around with taking that out.

The other place is in "thread continue", where it sets all the specified threads to running, and all the others to suspended. That's appropriate since thread continue is really a composite command that suspends all the unspecified threads, resumes all the specified ones, and continues the process. So we're posing as the user here.

Jim

Thanks! This has been very helpful. For reference, the reason why that SetResumeState() was ever there to begin with is because when I was first trying to make all this work I used the ProcessPOSIX as a model, and POSIXThread does the same thing. There’s even a comment saying that it doesn’t seem right, so I guess I should have looked more closely. But everything was still new to me at the time, so I wasn’t sure what the implications of fiddling with it were.

Thanks! This has been very helpful. For reference, the reason why that SetResumeState() was ever there to begin with is because when I was first trying to make all this work I used the ProcessPOSIX as a model, and POSIXThread does the same thing. There's even a comment saying that it doesn't seem right, so I guess I should have looked more closely. But everything was still new to me at the time, so I wasn't sure what the implications of fiddling with it were.

Yeah, I should have audited the Posix code as it was coming in more closely but I didn't have time to figure out the awful way the debugger has to interface with threads on Linux...

Anyway, try taking that out and see if things go better. It would also be interesting to see what goes wrong if you take that out on the POSIXThread side.

Jim

These calls in POSIXThread are only used for linux local-only
debugging, which noone tests right now, as the remote path is quite
stable. We would like to yank this code out completely in due time.

cheers,
pl

Thanks, all, for all the information and advice.

We’ve got it working in Windows now by removing the SetResumeState call from WillResume in the Windows-specific Thread implementation.

POSIX/Linux isn’t in my wheelhouse, so I’ll leave that for others.