Running other threads while stepping

I'm making good progress with the FreeBSD target thread support, but
see some differences in behaviour between my work and the Linux
support. (Thanks to Mike Sartain for sending me debug logs of the
Linux case.)

I'm using a version of Mike's pthread sample code that I've modified
for portability; it's available at
https://raw.github.com/emaste/snippets/master/mikesart_pthread.c if
anyone's interested. The test code spawns a couple of threads and
sleep()s in those threads. In my test I set breakpoints on both
sleep() calls and run the process. After hitting a breakpoint I
'step-in' or 'step-over' the sleep(), and I end up stopping on a
breakpoint in another thread, rather than moving to the next source
line. Mike's log on Linux doesn't show this; the process carries on
with the next line.

SVN r166732 has the commit message "Found a couple more places where
we need to run all threads when stepping," and as described it does
introduce cases where all threads are run for a step-in or step-over.
For whatever reason this affects the ThreadPlans I encounter on
FreeBSD, but not Linux.

I'm curious about the reason this change was made, and whether the
Linux behaviour or mine is expected. I can see arguments both ways,
but would expect the behaviour to be consistent on both platforms.

When you say that you end up stopping on a breakpoint in another thread, do you mean that you stop at the user breakpoint that you set on the sleep call or that the other thread stops on the internal breakpoint associated with the step command?

If the other thread is hitting the user breakpoint, that sounds reasonable and it's probably just a matter of test timing whether or not it happens. That is, you might hit the breakpoint in both threads at once and have it behave like a single stop (except that both thread stops should be reported), or you might hit just one and then hit the other during stepping. In the log (assuming that Linux process logging is enabled), the case of both breakpoints being hit at once would show up as a breakpoint event being processed while we were trying to stop all threads.

If, on the other hand, the internal breakpoint for the step is being hit by a thread other than the thread that's stepping, that's a problem. If that happens the thread plan should look at the breakpoint, see that it's an internal breakpoint associated with a particular thread, and continue silently if the thread that hit the breakpoint is not the current thread. If that were happening, you would see it in the log file but it would otherwise be invisible to the user.

Incidentally, one of the changes in r166732 looks wrong. The first change in ThreadPlanStepInRange.cpp was this:

+ if (m_stop_others == lldb::eOnlyThisThread)
             stop_others = false;
+ else
+ stop_others = true;

Unless I'm reading this wrong that will cause the thread plan to run other threads if the selected mode has requested that only the current thread be run during stepping but will stop other threads in other case, which seems exactly backward from what is intended.

-Andy

When you say that you end up stopping on a breakpoint in another thread, do you mean that you stop at the user breakpoint that you set on the sleep call or that the other thread stops on the internal breakpoint associated with the step command?

It's the former - when I step I arrive at the sleep(1) in the main thread.

If the other thread is hitting the user breakpoint, that sounds reasonable and it's probably just a matter of test timing whether or not it happens. That is, you might hit the breakpoint in both threads at once and have it behave like a single stop (except that both thread stops should be reported), or you might hit just one and then hit the other during stepping. In the log (assuming that Linux process logging is enabled), the case of both breakpoints being hit at once would show up as a breakpoint event being processed while we were trying to stop all threads.

In the FreeBSD case this race doesn't exist -- when the first thread
hits the breakpoint the entire process is stopped, and there's no way
to return another stop event until after the process is restarted.
Looking at the log confirms - both threads (the 3rd hadn't been
started yet) are set to running for the step.

Incidentally, one of the changes in r166732 looks wrong. The first change in ThreadPlanStepInRange.cpp was this:

+ if (m_stop_others == lldb::eOnlyThisThread)
             stop_others = false;
+ else
+ stop_others = true;

Yes, this looks backwards. I did some further digging into one case
of my issue and it comes from the equivalent test in
ThreadPlanStepOverRange.cpp:

    if (m_stop_others == lldb::eOnlyThisThread)
        stop_others = true;
    else
        stop_others = false;

which at least has the expected sense.

In my case m_stop_others is eOnlyDuringStepping. With the above test
stop_others is false and so ThreadPlanStepOverRange::ShouldStop calls
m_thread.QueueThreadPlanForStepOut and queues a thread plan that runs
the other threads.

Presumably I can explicitly suspend all threads but one before
stepping. From a user's perspective though it currently seems
arbitrary whether or not other threads will run during a step.

Incidentally I found what appears to be another bug here (in
ThreadPlanStepOverRange::ShouldStop). older_ctx_is_equivalent is
initialized to true, then set to true if certain conditions are met.
I presume this change is suitable:

--- a/source/Target/ThreadPlanStepOverRange.cpp
+++ b/source/Target/ThreadPlanStepOverRange.cpp
@@ -121,7 +121,7 @@ ThreadPlanStepOverRange::ShouldStop (Event *event_ptr)
             // in so I left out the target check. And sometimes the
module comes in as the .o file from the
             // inlined range, so I left that out too...

- bool older_ctx_is_equivalent = true;
+ bool older_ctx_is_equivalent = false;
             if (m_addr_context.comp_unit)
             {
                 if (m_addr_context.comp_unit == older_context.comp_unit)

When you say that you end up stopping on a breakpoint in another thread, do you mean that you stop at the user breakpoint that you set on the sleep call or that the other thread stops on the internal breakpoint associated with the step command?

It's the former - when I step I arrive at the sleep(1) in the main thread.

If the other thread is hitting the user breakpoint, that sounds reasonable and it's probably just a matter of test timing whether or not it happens. That is, you might hit the breakpoint in both threads at once and have it behave like a single stop (except that both thread stops should be reported), or you might hit just one and then hit the other during stepping. In the log (assuming that Linux process logging is enabled), the case of both breakpoints being hit at once would show up as a breakpoint event being processed while we were trying to stop all threads.

In the FreeBSD case this race doesn't exist -- when the first thread
hits the breakpoint the entire process is stopped, and there's no way
to return another stop event until after the process is restarted.
Looking at the log confirms - both threads (the 3rd hadn't been
started yet) are set to running for the step.

Incidentally, one of the changes in r166732 looks wrong. The first change in ThreadPlanStepInRange.cpp was this:

+ if (m_stop_others == lldb::eOnlyThisThread)
            stop_others = false;
+ else
+ stop_others = true;

Yes, this looks backwards. I did some further digging into one case
of my issue and it comes from the equivalent test in
ThreadPlanStepOverRange.cpp:

   if (m_stop_others == lldb::eOnlyThisThread)
       stop_others = true;
   else
       stop_others = false;

which at least has the expected sense.

In my case m_stop_others is eOnlyDuringStepping. With the above test
stop_others is false and so ThreadPlanStepOverRange::ShouldStop calls
m_thread.QueueThreadPlanForStepOut and queues a thread plan that runs
the other threads.

Presumably I can explicitly suspend all threads but one before
stepping. From a user's perspective though it currently seems
arbitrary whether or not other threads will run during a step.

Whether to run one or all threads is a balancing act between the desire on the one hand to have "step" and "next" not cause other threads to run, and on the other hand to not cause the process you're debugging to deadlock and stall because the thread you ARE stepping or nexting tried to acquire a lock held by another thread.

lldb's stepping logic makes the assumption that single stepping over instructions in the frame the user is currently moving through can be safely done when running only one thread, but anything that might cause arbitrary code to run is NOT safe to do with just one thread running. The rational behind this is that you probably aren't ever going to try to step over the instruction that actually tries to acquire a lock, but you very well might step over a line of code that contains a function call that tries to acquire a lock. In practice, this works pretty well.

This isn't academic. Back when I wasn't letting all threads run on the "step-out" that happened while doing "step-over" I got a bunch of reports of "step-over" leading to deadlocks in the application being debugged. I haven't had any since making this work the way it is now. It might seem a little arbitrary to users whether other threads get to run or not, but it is better than having the user have to manage cases where their process should have made some progress during this step but is isn't, say by interrupting it, and continuing the logical step operation you were in the middle of, but this time with more threads allowed to run.

The other option would be to do some timing kind of heuristic, where we let only one thread run with a timeout, and if it doesn't return after that timeout we interrupt it resume it with all threads running. That might be worth trying, but you'd have to put in place a mechanism for ThreadPlans to register a timeout and a callback on timeout that could adjust and restart the process, and that isn't in place yet. Not sure this would appear any more arbitrary than what we do now, however, though it might allow more code to run on only a single thread.

So, whenever you are single stepping from instruction to instruction in the frame you started in, we consider it safe to only run one thread. But if you have to do a "step-out" you are allowing arbitrary code to run, and that is unsafe without allowing all threads to run. So in that case, we allow all threads to run. Note, the "stop_others" is only used if we end up pushing another plan, generally a step-out or step-through plan. We use the value in m_stop_others in the step thread plan itself if we decide we've got more stepping in this range to do.

That's the way it is supposed to work... It does look like ThreadPlanStepInRange test is some kind of thinko. In practice, in this case it is probably safe to run only one thread when doing the "step through" since that generally involved running from a shared library stub to its target. That could deadlock if the dynamic loader has to fix up the symbol, and another thread is in the middle of doing that. But that doesn't seem to be very common, or at least the code is clearly wrong but I haven't had any reports of this causing deadlocks...

The StepOverRange one is correct. The default for stepping is eOnlyDuringStepping, by which I really mean "only when stepping directly through the code in the current frame" and is a best effort kind of thing. Changing the run mode to eOnlyThisThread should only be done if you KNOW there is no possibility of deadlocks, or are willing to handle working out of the deadlocks yourself. For the purposes of testing, you could switch the mode to eOnlyThisThread in your steps if you want to force the order in which things happen. But that shouldn't be the default.

Incidentally I found what appears to be another bug here (in
ThreadPlanStepOverRange::ShouldStop). older_ctx_is_equivalent is
initialized to true, then set to true if certain conditions are met.
I presume this change is suitable:

--- a/source/Target/ThreadPlanStepOverRange.cpp
+++ b/source/Target/ThreadPlanStepOverRange.cpp
@@ -121,7 +121,7 @@ ThreadPlanStepOverRange::ShouldStop (Event *event_ptr)
            // in so I left out the target check. And sometimes the
module comes in as the .o file from the
            // inlined range, so I left that out too...

- bool older_ctx_is_equivalent = true;
+ bool older_ctx_is_equivalent = false;
            if (m_addr_context.comp_unit)
            {
                if (m_addr_context.comp_unit == older_context.comp_unit)

Yes, that seems wrong. This code is there as a backstop for when the unwinder drops a frame at the beginning of new function/trampoline or whatever. In that case, (older_ctx_is_equivalent == false) we will see if we are at a trampoline function that somebody knows how to get out of, and otherwise we will stop.

My guess is that doesn't happen very often anymore, so the sanity check is always succeeding anyway. Your suggestion seems fine.

Jim

Jim - thanks for this comprehensive and detailed post.

I'm not sure about emitting a message, people generally don't like being told something they don't really understand and can't do anything about. And adding output to the console window should be done judiciously anyway, since that makes it harder to read the output you really care about.

I do want to add some command that will inspect the current thread plans, so you could see how they are stacked and cancel ones you don't care about anymore. For instance, if you do:

step over -> hit a breakpoint below on the stack
step over -> hit another breakpoint

etc. Then if you continue, you'll finish the latest step over (unless you hit another breakpoint in the process...) Then continue again will complete the second step over. This is often quite convenient, but sometimes you just want to forget about the step-over's and continue running. This would provide a way to do that. Perhaps that would be a good place to insert some info about how the plans will run the threads.

Jim

That would be perfect. I ended up adding a bunch of log Printfs to
the thread plan functions in my tree in order to understand what was
happening, including the stop_others value when pushing / queuing a
sub plan. It might be handy to have the capability of querying the
thread plans for a step / next without actually executing it as well.

Yes, that would be interesting as well. This is not something I'm likely to have time to do in the near term, but if anybody else wants to take it on I'd be happy to help...

Jim

It looks like this was never changed -- shall I commit the inverted case?