Improve single thread stepping

We have several internal Linux targets that have a non-trivial number of threads (+3000). I acknowledge that this is not ideal, and we’ve provided feedback to the target owners to reduce the thread count.

The high number of threads has resulted in significant slowness when using LLDB for stepping. In some cases, it takes around 10 seconds to step through certain non-trivial statements. We’ve observed that switching to single-thread stepping significantly improves performance, reducing the time to around 3 seconds. However, we are aware that single-thread stepping can be risky and may lead to deadlocks.

To address this issue, I’d like to explore a couple of options:

  1. @clayborg mentioned that there used to be a mode or logic to specify a timeout for single-thread stepping. This timeout would allow the thread plan to resume other threads if it exceeds the timeout, potentially avoiding deadlocks. Has this logic been reverted?
  2. Is it feasible to introduce some form of recovery logic? For instance, when a deadlock occurs, users typically pause asynchronously in LLDB and then resume. Currently, this doesn’t resolve deadlocks. However, we could modify LLDB to resume all threads to address the deadlock situation.
  3. This could potentially be an RFC. The existing while-stepping ThreadPlan prevents other threads from executing unless there are call sites within the step range. This approach is quite conservative. On Linux, is it possible to improve the stepping mode by using the ptrace() to detect syscall from interior only then do we resume other threads? This assumes that only syscalls can hold locks, which may not hold true for user-mode spin locks. We’re unsure of the complexity involved in adding special detection for user-mode spin locks. If this approach is viable, it could allow us to remain in single-thread stepping mode for most call sites.

cc @jingham, @labath

I have a couple of responses.

First:

What part(s) of stepping in this situation are taking the majority of that time? It’s hard to reason about solutions without knowing what the actual problem is.

I’m confused in particular about why only allowing one thread to run speeds things up. The only difference from lldb’s perspective is that in the run one thread case it tells the system to suspend all the other threads, rather than telling it not to. If anything, I would have guessed that having to suspend all the other threads would be more time-consuming, since you generally don’t have to do anything to have the thread continue (making all-thread stepping faster).

So to give a real answer, I’d need some finer-grained data.

Second:

To answer your individual points:

  1. Greg is almost right. There’s a routine in Process called “RunThreadPlan” which is meant to handle running a thread plan a bit on one thread, then falling back to running all threads. When I designed that originally it was meant to be general, but the first use it was put to was for running the expression evaluation thread plans, and since we didn’t have a use it other than that, over time it’s gotten too specific to that purpose. But you certainly could re-extend that machinery to handle general thread plans.

  2. IIUC, getting RunThreadPlan to handle the stepping plans as well as the expression running plans would give you your “Recovery mechanism”. Note, some thread plan uses - e.g. when using the step-instruction plan to step over breakpoints - have to run only one thread, so you’d have to put in some force-one-thread mode to control for that.

  3. If ptrace will stop you synchronously when a syscall occurs, then treating that event as a “run all threads” signal should be fairly straightforward. The current “Explained Stop” in the stepping thread plans for this is “stopping at a branch instruction”. Adding “got the ptrace syscall notification” as another “Explained Stop” in the stepping plans should be straightforward. The hardest part is probably structuring this in a pluggable system so you can support other systems with different such decision points.

As for “I can detect all potential deadlocks in any random code that lldb might get shown” - I’m a little skeptical, but even so, a strategy that catches the common cases, coupled with a fairly long “run on a single thread” timeout from doing #1 might be close enough in effect, even if you can’t catch all the cases that could deadlock a thread.

So the strategies you mention, RunThreadPlan for stepping, with a reasonably long “run on one thread” timeout, and some help for known synchronization points, should not be particularly hard to implement, or at all out of place with the way execution control currently works.

Third:

Another idea Pavel and I discussed a couple years back for situations with lots of threads is that until lldb decides to return control to the user, it really doesn’t need to know about threads that don’t “stop for a reason” (e.g. breakpoint, signal, single-step completion). At any given stop, most threads were just running along when one thread stopped the process, and those threads don’t contribute to any of the “should stop” negotiation we do on stop.

lldb already internally supports a debug stub that only tells it about threads that have stopped “for a reason” at any given stop, so this wouldn’t require much work in lldb to make that happen. You would need to change the stub so it can tell when to return “threads stopped with a reason” and “all threads”. One simple way to do that is to have an extended stop reply packet that fills in only the threads stopped “for a reason” and then lldb can get all threads when it needs to with jThreadsInfo.

Anyway, so to make this work we’d just have to teach the stub (lldb-server) to return only the threads "stopped for a reason " in the stop packet, and set lldb in the forgetful stub mode. If you want to play with this mode, it’s currently set by the overly specific:

target.process.experimental.os-plugin-reports-all-threads

setting, but it’s actually a general feature in lldb. It’s named that because to date OS Plugins were the only ones that actually did this.

For the most part, users want to know that all their threads are persisting, so I don’t think it’s a great UE if we remain ignorant of “uninteresting” threads on user stop. Plus getting all the threads every so often really helps our garbage collection. But most user-level stepping actions involve a bunch of private stops, and for those we definitely don’t need any threads that haven’t stopped “for a reason”.

That strategy would seem promising, since it greatly reduces the data and processing on all these private stops. But OTOH, your saying that “stepping only a single thread is much faster” makes it sound like this isn’t the primary issue, since we’re told about all threads regardless of whether we’re going to let them run or suspend them.

So again, before going much futher with this discussion I’d want to know what is slow here in more detail.

1 Like

Thank you for the insights. This is incredibly helpful!

This is my first time delving into the ThreadPlan logic, and I’ve only done a brief read on single-thread stepping. Regarding the performance bottleneck, the biggest challenge lies in determining whether it’s latency or CPU-bound. Based on my investigation of a step operation taking 8-10 seconds, it appears that there are 8 internal/private stops, 3 of which involve stopping other threads, while the other 5 require resuming all threads.

I’ve profiled LLDB during the stepping process using the Linux perf tool, which reports various CPU bottlenecks at around 75%. The most critical path appears to be in ProcessGDBRemote::GetThreadStopInfoFromJSON(), called from ProcessGDBRemote::CalculateThreadStopInfo(). It seems that we enumerate through each thread and attempt to find its stop information based on TID from m_jstopinfo_sp. In the case of stopping all threads, the m_jstopinfo_sp JSON can become quite large, potentially O(N^2). There’s a similar issue in ThreadList::GetBackingThread(). I’ve made a quick prototype using a hash table to map <TID => stop_info>, and it seems to improve stepping performance by around 10-20%, although it’s not as significant as single-thread stepping. I’ve also noticed various JSON parsing operations in ProcessGDBRemote::SetThreadStopInfo(). It seems that the size of jstopinfo is much larger when stopping all threads.

Another possibility for the slowness is latency-bound, as you’ve mentioned. Resuming/pausing 3000+ threads and synchronously waiting for them can indeed take some time.

I’ve also experimented with setting set target.process.experimental.os-plugin-reports-all-threads to false, hoping it would reduce jstopinfo, but I haven’t seen any performance improvement with this setting. Currently, it’s not entirely clear to me what is causing the slowness, but if you ask me, it’s most likely due to the latency issue mentioned above.

The second and third responses are quite intriguing. I might explore some of them if this turns out to be a high-priority issue. I had the same question in my mind while examining the profile trace—whether we really need to update the stop info during private/internal stops. However, I haven’t delved deep enough to propose any solutions. It’s great that you and Pavel have already explored this to some extent. I wonder if, to reduce the size of jstopinfo, setting target.process.experimental.os-plugin-reports-all-threads to false is sufficient or if further optimizations are needed?

Snippet of the profile trace:

|--49.81%--lldb_private::ThreadList::ShouldStop(lldb_private::Event*)
                          |          |
                          |           --49.43%--lldb_private::Thread::GetStopInfo()
                          |                     |
                          |                      --49.35%--lldb_private::Thread::GetPrivateStopInfo(bool)
                          |                                |
                          |                                 --49.30%--lldb_private::process_gdb_remote::ThreadGDBRemote::CalculateStopInfo()
                          |                                           lldb_private::process_gdb_remote::ProcessGDBRemote::CalculateThreadStopInfo(lldb_private::process_gdb_remote::ThreadGDBRemote*)
                          |                                           |
                          |                                            --48.63%--lldb_private::process_gdb_remote::ProcessGDBRemote::GetThreadStopInfoFromJSON(lldb_private::process_gdb_remote::ThreadGDBRemote*, std::shared_ptr<lldb_private::StructuredData::Object> const&)
                          |                                                      |
                          |                                                      |--12.24%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(lldb_private::StructuredData::Dictionary*)
                          |                                                      |          |
                          |                                                      |           --12.15%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(unsigned long, std::map<unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, unsigned char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, bool, lldb_private::LazyBool, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, lldb::QueueKind, unsigned long)
                          |                                                      |                     |
                          |                                                      |                     |--5.16%--lldb_private::ThreadList::GetBackingThread(std::shared_ptr<lldb_private::Thread> const&)
                          |                                                      |                     |
                          |                                                      |                     |--2.94%--lldb_private::ThreadList::FindThreadByProtocolID(unsigned long, bool)
                          |                                                      |                     |

|--24.73%--lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop()
                          |          |
                          |          |--19.68%--lldb_private::Process::UpdateThreadListIfNeeded()
                          |          |          |
                          |          |          |--11.50%--lldb_private::ThreadList::Update(lldb_private::ThreadList&)
                          |          |          |          |
                          |          |          |           --0.89%--lldb_private::Thread::GetBackingThread() const
                          |          |          |
                          |          |          |--3.08%--lldb_private::ThreadPlanStackMap::Update(lldb_private::ThreadList&, bool, bool)
                          |          |          |          |
                          |          |          |           --2.95%--lldb_private::ThreadList::FindThreadByID(unsigned long, bool)
                          |          |          |
                          |          |          |--2.76%--lldb_private::Thread::GetBackingThread() const


On Nov 2, 2023, at 11:01 AM, Jeffreytan81 via LLVM Discussion Forums notifications@llvm.discoursemail.com wrote:

jeffreytan81
November 2

Thank you for the insights. This is incredibly helpful!

This is my first time delving into the ThreadPlan logic, and I’ve only done a brief read on single-thread stepping. Regarding the performance bottleneck, the biggest challenge lies in determining whether it’s latency or CPU-bound. Based on my investigation of a step operation taking 8-10 seconds, it appears that there are 8 internal/private stops, 3 of which involve stopping other threads, while the other 5 require resuming all threads.

I’ve profiled LLDB during the stepping process using the Linux perf tool, which reports various CPU bottlenecks at around 75%. The most critical path appears to be in ProcessGDBRemote::GetThreadStopInfoFromJSON(), called from ProcessGDBRemote::CalculateThreadStopInfo(). It seems that we enumerate through each thread and attempt to find its stop information based on TID from m_jstopinfo_sp. In the case of stopping all threads, the m_jstopinfo_sp JSON can become quite large, potentially O(N^2). There’s a similar issue in ThreadList::GetBackingThread(). I’ve made a quick prototype using a hash table to map <TID => stop_info>, and it seems to improve stepping performance by around 10-20%, although it’s not as significant as single-thread stepping. I’ve also noticed various JSON parsing operations in ProcessGDBRemote::SetThreadStopInfo(). It seems that the size of jstopinfo is much larger when stopping all threads.

That was supposed to be “stepping all threads” right?

Ingesting the thread list is the main piece of work here, so it’s not surprising that’s what is showing up hot. And that there are some places where we didn’t use the most efficient algorithm because we never had enough threads to notice isn’t terribly surprising. Fixing those is pure goodness!

If your program’s thread population is growing or fluctuating, rather than being at a steady state, then you will indeed see different numbers of threads reported when you step only one of the non-thread creating threads, as opposed to letting all threads run in the same scenario because the thread creating threads got a chance to run. And it will also increase the latency by the amount of time those other threads get a chance to run.

But that’s just postponing the inevitable, that work is going to get done at some point, and the threads will eventually get created. Plus, your speedup comes at the expense of altering program behavior which we try to avoid.

Other than that, I can’t think of any plausible reason why the thread numbers would vary between single thread running and all threads running. At present lldb-server’s contract is that it report all threads to us. It would be surprising if it was not doing that currently, and even weirder if that was gated on whether the last continue ran only one thread or not.

Another possibility for the slowness is latency-bound, as you’ve mentioned. Resuming/pausing 3000+ threads and synchronously waiting for them can indeed take some time.

I’ve also experimented with setting set target.process.experimental.os-plugin-reports-all-threads to false, hoping it would reduce jstopinfo, but I haven’t seen any performance improvement with this setting. Currently, it’s not entirely clear to me what is causing the slowness, but if you ask me, it’s most likely due to the latency issue mentioned above.

That setting just means lldb is READY to handle threads not being reported on all stops. But I very much doubt that lldb-server ever does that, so you wouldn’t expect this to make any difference without a comparable change in lldb-server. If anything, turning the setting on when the stub is always reporting all threads will make things worse, since we can’t retire threads that no longer exist, we instead transfer them to a list of “pended” threads and every time we see a new TID we have to ask if it’s in our list of pended threads first before making a new Thread object to represent it. We will prune the pended threads when you do a full “thread list” but I bet if you have 3000 threads you don’t do that very often.

We could probably make this a little nicer by fetching the full thread list right before we return control to the user. Since most steps through real code have 5-10 internal stops that would still be a decent speedup, w/o causing us to carry around so many pended threads.

The second and third responses are quite intriguing. I might explore some of them if this turns out to be a high-priority issue. I had the same question in my mind while examining the profile trace—whether we really need to update the stop info during private/internal stops. However, I haven’t delved deep enough to propose any solutions. It’s great that you and Pavel have already explored this to some extent. I wonder if, to reduce the size of jstopinfo, setting target.process.experimental.os-plugin-reports-all-threads to false is sufficient or if further optimizations are needed?

All of the logic that we do to prepare for the next time the ThreadPlans continue the process is based on thread stop info’s. The majority of those continues are from private stops. So no, we can’t do our job w/o knowing the stop info of all the threads we have to reason about at every stop. Note, if a thread has a trivial stop reason, then we just skip over it in the “what to do next” negotiations. That’s why not telling us about those threads doesn’t change the stepping decisions. But if you do tell us about a thread, we will need to know its stop reason to know what to do with it.

As I said above, the setting requires an equivalent change in lldb-server (to not report threads with no stop reason) for it to make a positive difference.

That bit of the profile isn’t terribly surprising…

Hope this helps,

Jim

Snippet of the profile trace:

|--49.81%--lldb_private::ThreadList::ShouldStop(lldb_private::Event*)
                          |          |
                          |           --49.43%--lldb_private::Thread::GetStopInfo()
                          |                     |
                          |                      --49.35%--lldb_private::Thread::GetPrivateStopInfo(bool)
                          |                                |
                          |                                 --49.30%--lldb_private::process_gdb_remote::ThreadGDBRemote::CalculateStopInfo()
                          |                                           lldb_private::process_gdb_remote::ProcessGDBRemote::CalculateThreadStopInfo(lldb_private::process_gdb_remote::ThreadGDBRemote*)
                          |                                           |
                          |                                            --48.63%--lldb_private::process_gdb_remote::ProcessGDBRemote::GetThreadStopInfoFromJSON(lldb_private::process_gdb_remote::ThreadGDBRemote*, std::shared_ptr<lldb_private::StructuredData::Object> const&)
                          |                                                      |
                          |                                                      |--12.24%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(lldb_private::StructuredData::Dictionary*)
                          |                                                      |          |
                          |                                                      |           --12.15%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(unsigned long, std::map<unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, unsigned char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, bool, lldb_private::LazyBool, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, lldb::QueueKind, unsigned long)
                          |                                                      |                     |
                          |                                                      |                     |--5.16%--lldb_private::ThreadList::GetBackingThread(std::shared_ptr<lldb_private::Thread> const&)
                          |                                                      |                     |
                          |                                                      |                     |--2.94%--lldb_private::ThreadList::FindThreadByProtocolID(unsigned long, bool)
                          |                                                      |                     |

|--24.73%--lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop()
                          |          |
                          |          |--19.68%--lldb_private::Process::UpdateThreadListIfNeeded()
                          |          |          |
                          |          |          |--11.50%--lldb_private::ThreadList::Update(lldb_private::ThreadList&)
                          |          |          |          |
                          |          |          |           --0.89%--lldb_private::Thread::GetBackingThread() const
                          |          |          |
                          |          |          |--3.08%--lldb_private::ThreadPlanStackMap::Update(lldb_private::ThreadList&, bool, bool)
                          |          |          |          |
                          |          |          |           --2.95%--lldb_private::ThreadList::FindThreadByID(unsigned long, bool)
                          |          |          |
                          |          |          |--2.76%--lldb_private::Thread::GetBackingThread() const


Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

1 Like

Hi @jingham,

I finally have time to take a stab at improving this. I’ve been able to refactor and refine Process:RunThreadPlan() for reuse in single-thread stepping and have got it working in a prototype. (I did have to restore the hijacking and allow some public events to go through so that the UI knows to resume/stop.)

Now, to really implement it, both @clayborg and I agree that we do not need to hijack the public event loop from Process:RunThreadPlan(), but rather incorporate this in the ThreadPlan instead. There are two potential approaches:

  1. Implement as a sub-thread plan: Basically, I can create a new ThreadPlanSingleThreadTimeout that can be created on-demand inside ThreadPlanStepOverRange. ThreadPlanSingleThreadTimeout is a small state machine that waits for a timeout, then sends a Process::Halt() call and transitions into a second state. When ThreadPlanSingleThreadTimeout receives a stop event during the second state, it will set stop_others to false and return false from the ::ShouldStop() method, telling the caller to resume the process. Since ThreadPlanSingleThreadTimeout has finished the job, it can pop itself from the plan stack. My experiment shows I have to be very careful to push ThreadPlanSingleThreadTimeout just after the parent ThreadPlanStepOverRange, otherwise, it might accidentally get a ThreadPlanStepOverBreakpoint in the middle, which can accidentally pop both when ThreadPlanStepOverBreakpoint finishes its work.
  2. Implement as a utility class for the existing thread plan: Similar to the first approach, this employs a state machine idea, but it’s used as a utility class and stored in a member field of ThreadPlanStepOverRange (let’s call it SingleThreadTimeoutHelper) if the run_mode is this thread only. ThreadPlanStepOverRange::ShouldStop() would first consult if SingleThreadTimeoutHelper wants to handle it; if handled, it will return false to resume. Otherwise, it falls back to the default processing. This seems to be a safer and easier to reason approach that won’t mess up the thread plan stack.

However, for both approaches, I encountered some issues:

  1. Process::Halt() is generating an interrupt stop event which is always treated as a public stop and ignoring the ShouldStop() return value: link to the code. Is there any mechanism to specify that I want a private internal interrupt instead of a public interrupt?
  2. To unblock the prototyping, I manually changed the above code to check the ShouldStop() return value and call PrivateResume() anyway. This helped me move forward, but later when ThreadPlanStepOverRange finishes, its ShouldStop() seems to report the wrong stack depth of link to the code of eFrameCompareYounger, which incorrectly caused a StepOut plan to be pushed while the current PC is definitely at the end of the step range. It appears to me that the stack frames are still getting the old PC from the previous stop and not being refreshed. I am not sure if this weird behavior is caused by my hack in #1 or not, but I just wanted to share.

Questions:

  1. Do you have any suggestions for the Process::Halt() issue?
  2. Which approach (sub-thread plan or utility class) do you prefer above?
  3. If you have thoughts/theories regarding the stale stack frames mismatching issue above, feel free to share as well. I’ve shared the stepping and gdb-remote logs below for this issue if it is useful.
Process 758304 stopped
(lldb) 1710292556.594755888 < 820> read packet: $T05thread:pb9220.b9220;name:a;threads:b9220,b93a0;jstopinfo:5b7b226e616d65223a2261222c22726561736f6e223a22627265616b706f696e74222c227369676e616c223a352c22746964223a3735383330347d2c7b226e616d65223a2261222c22746964223a3735383638387d5d;thread-pcs:0000555555558cf8,00007ffff78d4075;00:fe00000000000000;01:0000000000000000;02:0000000000000000;03:0000000000000000;04:2fd0ffffff7f0000;05:0000000000000000;06:90d0ffffff7f0000;07:40d0ffffff7f0000;08:0000000000000000;09:ffffffff00000000;0a:0000000000000000;0b:4602000000000000;0c:68d2ffffff7f0000;0d:a08e555555550000;0e:b0ca565555550000;0f:00d0fff7ff7f0000;10:f88c555555550000;11:0602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:c001ecf7ff7f0000;17:0000000000000000;18:0000000000000000;19:0000000000000000;reason:breakpoint;#00
1710292556.594939232 (plugin = gdb-remote, state = stopped)
1710292556.595000744 (plugin = gdb-remote, state = stopped, stop_id = 9
1710292556.595845699
1710292556.595905304 ThreadList::ShouldStop: 2 threads, 2 unsuspended threads
1710292556.595955133 Thread::ShouldStop(0x56190fcbf450) for tid = 0xb9220 0xb9220, pc = 0x0000555555558cf8
1710292556.595966101 ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
1710292556.596014738 Plan stack initial state:
  thread #1: tid = 0xb9220:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Stepping over line main.cpp:436:6 using ranges: [0x0000555555558ce8-0x0000555555558cfd).

1710292556.596048355 ThreadPlanStepRange::NextRangeBreakpointExplainsStop - Hit next range breakpoint which has 1 constituents - explains stop: 1.
1710292556.596064806 Removing next branch breakpoint: -2.
1710292556.596127987 <  21> send packet: $z0,555555558cf8,1#14
1710292556.596235514 <   6> read packet: $OK#9a
1710292556.596294403 ThreadPlanStepOverRange reached 0x0000555555558cf8.
1710292556.596309662 DebuggerEventTimeoutHandler::HandleEvent(): got event: stopped.
1710292556.596381903 th1/fr0 supplying caller's saved rip (16)'s location, cached
1710292556.596421480 <  21> send packet: $x7fffffffce00,200#5f
1710292556.597071648 < 516> read packet: $0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000a66788f7ff7f00000000000000000000000000000000000078d35655555500000000000000000000000000000000000050d356555555000000000000000000009a8e88f7ff7f00000000000000000000000000000000000000000000000000000000000000000000000000000000000070cfffffff7f0000208d88f7ff7f000090cfffffff7f000000000000000000000000000000000000000000000000000050d356555555000000d3565555550000000000000000000000000000000000000006033547d716abe0cfffffff7f0000000000000000000030d0ffffff7f000068d2ffffff7f0000a08e555555550000b0ca56555555000000d0fff7ff7f00002fd0ffffff7f0000#b1
1710292556.597126961  th1/fr1 pc = 0x7ffff7888e9a
1710292556.597144604 th1/fr0 supplying caller's saved rbp (6)'s location, cached
...
1710292556.600282431   Frame 2 invalid RegisterContext for this frame, stopping stack walk
1710292556.600293398 th1 Unwind of this thread is complete.
1710292556.600337505 eFrameCompareYounger: tid(758304), index(1), with 2 frames,
frame #0: 0x00007ffff788679a /lib64/libc.so.6`__futex_abstimed_wait_common + 202
frame #1: 0x00007ffff7888e9a /lib64/libc.so.6`pthread_cond_wait@@GLIBC_2.3.2 + 234

1710292556.600444555 Couldn't find step through plan from address 0x555555558cf8.
1710292556.600466728 Stepping out of frame with no debug info
1710292556.600478172 ShouldStopHere callback returned 0 from 0x555555558cf8.
1710292556.600511312 <  34> send packet: $qMemoryRegionInfo:7ffff7888e9a#c1
1710292556.600546837 < 103> read packet: $start:7ffff7828000;size:175000;permissions:rx;flags:;name:2f7573722f6c696236342f6c6962632e736f2e36;#a8
1710292556.600627899 <  21> send packet: $Z0,7ffff7888e9a,1#c0
1710292556.600689888 <   6> read packet: $OK#9a
1710292556.600765467 Thread::PushPlan(0x0x56190fcbf450): "Stepping out from a`simulate_thread() + 72 at main.cpp:436:6 returning to frame at libc.so.6`pthread_cond_wait@@GLIBC_2.3.2 + 234", tid = 0xb9220.
1710292556.600785494 Plan Step range stepping over should stop: 0.
1710292556.600851297 Plan stack final state:
  thread #1: tid = 0xb9220:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Stepping over line main.cpp:436:6 using ranges: [0x0000555555558ce8-0x0000555555558cfd).
      Element 2: Stepping out from a`simulate_thread() + 72 at main.cpp:436:6 returning to frame at libc.so.6`pthread_cond_wait@@GLIBC_2.3.2 + 234 using breakpoint site -3

1710292556.600867271 vvvvvvvv Thread::ShouldStop End (returning 0) vvvvvvvv
1710292556.600891829 Thread::ShouldStop for tid = 0xb93a0 0xb93a0, pc = 0x00007ffff78d4075, should_stop = 0 (ignore since no stop reason)
1710292556.600904703 ThreadList::ShouldStop overall should_stop = 0
1710292556.600918293 ThreadList::ShouldReportStop 2 threads
1710292556.600933790 ThreadPlanStepRange::ShouldReportStop() returning vote -1

This is just a quick - initial impressions - reply…

The run thread plan behavior for expressions is a little more than just a general “try the plan on one thread than all” plan. For instance, in one mode --ignore-breakpoints 1 --unwind-on-error 1 this is a purely synchronous operation, we really pretend in this case that the target didn’t run at all, it just magically got an expression result or an error. In particular, we silently continue from breakpoint stops are hit.

However, in other cases it is “synchronous, but it will stop for extra-ordinary events”. If you run an expression with --ignore-breakpoints 0 then if the expression completes w/o hitting a breakpoint, it produces no events, but if it hits a breakpoint, it returns control to the user while remaining on the thread stack.

This semi-sync behavior has to be modeled somewhere, but I don’t think it’s a feature of the ThreadPlanSingleThreadTimeout. It’s more specific to expressions. How were you handling that part of the expression business?

As for inserting pairs of thread plans, whenever a thread plan has to push a subsidiary thread plan as it gets started, the right way to do that is in DidPush of the first plan. Between pushing the first plan and calling DidPush, no-one else gets to mess with the plan stack, so that will keep your pair of plans together. You might then end up with a ThreadPlanStepOverBreakpoint pushed on the bottom of the stack when the process starts again, so it can do the immediate task of stepping over the breakpoint & re-enabling it. Did you try that?

I think doing this as a thread plan is more composable, so that seems a better direction.

The one thing that I want to preserve however is that ThreadPlans don’t ever synchronously wait for anything from the target, they are always just responding to stop/running events from the process fed to them from the general stop handling machinery.

So for instance, they certainly shouldn’t call Halt, since it involves waiting for events, and thread plans don’t do that, they set things going, and then wait for the general machinery to call them back next time the process Stops. I think if you start trying to put modal wait loops inside of thread plan callbacks the system will get hard to reason about.

It would make more sense to have the state machine that’s governing the one-to-many thread transition call SendAsyncInterrupt and go back to waiting for the next stop event, which should be the interrupt.

However, that’s won’t work off the bat because while SendAsyncInterrupt will produce a stop event, you are not guaranteed that your thread will be the one that catches the signal. On most Unix systems, it’s usually the main thread that gets fed signals. So the thread that requested the interrupt may not have a stop reason for the interruption stop.

We could fix that by adding a SendAsyncInterrupt(Thread *thread) API that arranges to send the interrupt, then attributes the stop to the thread that was passed to the API - kind of like pthread_kill but from the outside.

I think that would be a lot cleaner. Your thread plan, when it woke up for the timeout, would just call this new SendAsyncInterrupt API and go back to waiting. Then when the interrupt stop reason came in on the thread governed by the plan, it could check its state machine, which explains the stop, recognizes it as the one-many transition and changes the thread run states and continues.

About stale frames, first off I’m pretty sure the approach I outlined above will not have that problem. The thing to remember is that until you pull the stop event off the public event queue, the process hasn’t formally stopped, and if you ask lldb its state you will get something stale. That’s on purpose because we need to keep the “process state” and the event stream in sync. Most likely you didn’t consume the event correctly or you didn’t update the run lock when you did that. That’s the danger of trying to do things modally within thread plans, and why you really want to keep the thread plan model as straightforward as possible: “get a stop, set up reactions by pushing more thread plans or setting breakpoints, then wait for next stop”.

1 Like

DidPush(). Did you try that?

Ah, that’s exactly what I want. Thanks for pointing out.

The one thing that I want to preserve however is that ThreadPlans don’t ever synchronously wait for anything from the target, they are always just responding to stop/running events from the process fed to them from the general stop handling machinery.

Agreed, my prototype is letting ThreadPlanSingleThreadTimeout to async receive the interrupt stop event. I was not aware that ::Halt() is synchronous, will switch to use SendAsyncInterrupt() instead.

that’s won’t work off the bat because while SendAsyncInterrupt will produce a stop event, you are not guaranteed that your thread will be the one that catches the signal

Good point. Haven’t thought of this issue. Will try to enhance SendAsyncInterrupt(Thread *thread) API. But for my prototype, SendAsyncInterrupt() seems lucky enough to always to send back stop event to the correct thread.

Most likely you didn’t consume the event correctly or you didn’t update the run lock when you did that.

Just want to point out that it is not my code that tries to check stack depth, it is existing ThreadPlanStepOverRange code here(llvm-project/lldb/source/Target/ThreadPlanStepOverRange.cpp at 0b4688403672264ab451992a3461a0df113c3bd7 · llvm/llvm-project · GitHub).

Unfortunately, after incorporating all these feedback, the prototype is suffering the two issues I mentioned above:

  1. The interrupt stop event is treated different than normal stop event here (llvm-project/lldb/source/Target/Process.cpp at 0b4688403672264ab451992a3461a0df113c3bd7 · llvm/llvm-project · GitHub) so it always stops even thread plan stack returns false for ::ShouldStop(). @clayborg’s thought is that we need to attach some metadata in this special interrupt stop event so that caller won’t always stop. Not sure what is the best way though.

  2. Haven’t tried to workaround #1 to check if stale frames issue persists.

I have shared my prototype in a draft patch. Hopefully it can help to demonstrate the issue: Single thread stepping timeout prototype by jeffreytan81 · Pull Request #85260 · llvm/llvm-project · GitHub

Interrupts are tricky because

(a) In practice, lldb just knows it has interrupted the whole inferior, and doesn’t actually have a way to know which thread’s StopInfo will reflect the actual interrupt.
(b) The StopInfo on the thread is not enough to tell you that this was an lldb interrupt - it could just as easily be someone sending the same signal to your process for other reasons. So currently threads can’t really reason about interrupt stops. You need to know that you had just sent the interrupt to disambiguate that.

But you are going to have to do some modification in this area to support SendAsyncInterrupt to a thread anyway. It seems like a natural way to implement that feature is that when you get the stop after SendAsyncInterrupt did it’s job, you would remove the interrupt StopInfo from the thread that happened to catch the stop signal, and put an interrupt StopInfo on the thread that asked for the interrupt. Instead of overloading the StopInfoSignal it would be cleaner to add a StopInfoInterrupt, and then add that to the designated thread.

Then in the case of non-thread-specific interrupts, we can do the same thing, detect which thread’s StopInfo was from the interrupt, and change that to StopInfoInterrupt (this is the same as the case where you asked for an interrupt for thread A and thread A happened to catch the signal, so that should fall out naturally).

Now that we’re at the point where there’s a thread that knows about the interrupt, not the program as a whole, we can then let the thread plans handle the interrupt, and expect to get a useable reply back from ShouldStop.

The last bit of this is that for the thread plans to handle general interrupts, there has to be some plan whose job that is. But that’s easy, this is equivalent to the handling of User Breakpoints. Those are handled by ThreadPlanBase, which is also a suitable place to handle “This thread got an interrupt that no other plan explained”.

Thanks Jim.

I seem to be able to get the prototype working with suggestion (the stack frame issue went away :slight_smile: ).

There are still several design considerations I would like to hear your thoughts before I revise the patch for review:

  1. Even after ThreadPlanSingleThreadTimeout finishes its state machine I chose not to pop it because highly likely it is in the middle of plan stack so popping it will accidentally remove other uncompleted child plans (like ThreadPlanStepOut). Does this make sense?

  2. How to ensure ThreadPlanSingleThreadTimeout get async interrupt callback.

After ThreadPlanSingleThreadTimeout sends the thread specific async interrupt, Thread::ShouldStop() will ask current stack plan whether it explains the stop and then calls its ::ShouldStop() method. However, ThreadPlanSingleThreadTimeout won’t be the current plan so it is up to the collaboration of current thread plan to yield control to ThreadPlanSingleThreadTimeout. More specifically, I have to modify ThreadPlanStepOut::DoPlanExplainsStop to return false for the new eStopReasonInterrupt so that ThreadPlanSingleThreadTimeout got a chance to handle the async interrupt:

1711336859.240231037 Plan stack initial state:
  thread #1: tid = 0xa97022:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Stepping over line main.cpp:436:6 using ranges: [0x00000001000034e8-0x00000001000034f8).
      Element 2: ThreadPlanSingleThreadTimeout, state(1)
      Element 3: Stepping out from a`void std::__1::condition_variable::wait<simulate_thread()::$_0>(std::__1::unique_lock<std::__1::mutex>&, simulate_thread()::$_0) at __mutex_base:396 returning to frame at a`simulate_thread() + 76 at main.cpp:436:6 using breakpoint site -5

Does this collaboration model make sense to you?

  1. How to ensure all threads resume after timeout.

The initial attempt is calling Thread::GetCurrentPlan::SetStopOthers(). But that may mean calling ThreadPlanStepOut::SetStopOthers() which is not implemented and does nothing. This can be easily fixed by overriding ThreadPlanStepOut::SetStopOthers(). But I wonder if it would be safer to have a Thread::SetStopOthers() API to explicitly call SetStopOthers for each ThreadPlan in the stack so that all threads resuming is not only reflected in leaf thread plan but everything after timeout.

Let me know you thoughts. Thanks.
Jeffrey

For points 1 & 2:

I don’t think it’s a good idea to keep the ThreadPlanSingleThreadTimeout on the stack after any stop.

If a thread plan’s run involves a series of child plans, when each of those child plans stops, the controlling thread plan may well want to recompute how long the one thread timeout should be for the next stage. For instance, if what you are primarily trying to avoid is deadlocks, you might use the same one-thread timeout every time you resumed after an internal stop, because this thread is obviously still making progress. More generally, the controlling plan is the one that knows the internal steps it will need to do, and is in a better position to reason about each resume it is doing.

So I think it’s going to be much easier to handle if the ThreadPlanSingleThreadTimeout only lives between two stops of that thread. It gets pushed on one stop, and the next time that thread stops for a reason it always gets popped. When it gets popped it can report the “time remaining” back to the Thread plan that pushed it (we have to cook up some way to do that). Then when the controlling Thread Plan wants to continue it would push another ThreadPlanSingleThreadTimeout, or it would enqueue a child thread plan passing it the appropriate one thread timeout.

That would also solve your async signal reception problem, because the ThreadPlanSingleThreadTimeout is always going to be the youngest member of the stack. The one exception to this is that if you are resuming from a breakpoint we’ll insert a ThreadPlanStepOverBreakpoint as a last step before resuming. Note, if the step over breakpoint exceeds the one thread timeout, that shouldn’t cause a problem. It doesn’t explain signal stops so it wouldn’t get in the ThreadPlanSingleThreadTimeout plan’s way of handling the stop. And it’s always fine to pop the ThreadPlanStepOverBreakpoint since we’ll make up a new one when we go to resume the thread and notice it’s sitting at a breakpoint.

For point 3:

Before your changes, there were lots of plans that it was unsafe to run with all other threads stopped, for instance ThreadPlanStepOut since that ran arbitrary code. So it didn’t make sense to try to handle GetStopOthers in generic code, there wasn’t a right answer for all plans that the ThreadPlan base class could determine.

With your changes, it should be safe to run most plans with a one-thread timeout. The exception will really only be plans that want to enforce running only one thread, like the ThreadPlanStepOverBreakpoint. I think we need to make all the plans aware that they can run with a timeout, otherwise the timeout feature will break if plan A needs to push a subplan that doesn’t know how to handle timeouts.

So we should be able to push the timeout handling into ThreadPlan. We probably want this to be a lower-level API in ThreadPlan that returns an optional, where true & false mean “there’s a one thread timeout in force, so you should trust me” and no value means “there’s no one thread timeout, do what is safe.” Then each class’s GetStopOthers will consult this API, and if it has an opinion, do what it says, and if it doesn’t, do what it is safe against deadlocks. Or of course, like ThreadPlanStepOverBreakpoint, ignore the base API since regardless of the timeout it wants to only run one thread so we don’t miss other threads hitting this breakpoint.

@jingham , was busy with other stuff and do not have time to work on this until now.

I like the ThreadPlanSingleThreadTimeout remove/re-add across internal stops idea and got it working for a basic testcase.

The implementation relies on ThreadPlanSingleThreadTimeout::MischiefManaged() always return true to ensure ThreadPlanSingleThreadTimeout got popped/removed for each stop. And it relies on parent step-over plan ThreadPlanStepOverRange::DoWillResume() to re-push a new ThreadPlanSingleThreadTimeout to ensure it is in the current/leaf plan.

However, when I try it against real world target, I have observed several issues:

  1. In the middle of running, certain modules got loaded triggering posix dynamic loader’s Rendezvous breakpoint hit as stop event. This special private stop info ShouldStopSynchronous() return false and early return from Thread::ShouldStop() function without giving ThreadPlanSingleThreadTimeout a chance to pop/remove itself. Later, ThreadPlanStepOverRange::DoWillResume() pushes a second ThreadPlanSingleThreadTimeout which is not what we wanted. I can try to detect and pop plans from ThreadPlanSingleThreadTimeout::WillStop(), but I am not sure that’s the correct way.

  2. There can be a race condition that, after ThreadPlanSingleThreadTimeout sends an async interrupt and before it receives it, some other thread plan/stops pops/removes ThreadPlanSingleThreadTimeout, leaving no thread plan to handle the async interrupt. We will have to remember the previous state (async interrupt sent) in parent thread plan, which seems to be ugly.

With above observations, I am less confident that we can easily ensure ThreadPlanSingleThreadTimeout can be singleton and always stay the youngest during various complex interactions. What do you think?

Jeffrey

Just a small drive-by.

On linux, only threads that are running can receive signals. So, if you only have one thread running, the interrupt signal will always be delivered to that thread (unless it stops on its own before that happens).

I don’t know how this works on other operating systems. However, I think that an interesting question here is to see what should the communication at the gdb-level look like. The gdb communication for a step-thread+interrupt sequence would look something like:

-> vCont;c:<thread1>
# later
-> ^C
<- $T??thread:<thread2>

The question is can <thread1> be ever different <thread2>. On linux, this will never be the case. However, I would say that this should be true for other OSes as well. I’m clearly biased here, but as a gdb-remote client, I would find it very confusing if I asked the the server to run one thread, and it came back saying that another thread has stopped (regardless of whether that stop was caused by my ^C or some external signal).

If we can agree on that, then I think it would be up to the server to map the stop signal to a running thread (if it’s not running on an OS that will do that automatically). And the client should be able to mostly (*) rely on that.

(*) Only mostly, because the situation can be still ambiguous if there is more than one thread running. So if you want/need to make the code generic enough to handle multiple running threads, you may need to have some special client-side code as well.

I don’t think you can change the plan stack while the private state is “running”. You couldn’t from the outside, any command like “step” will require the target be stopped to even allow the command. I’m not sure whether we also enforce this at the lower layer (e.g. have PushPlan check the private state and return an error if running, but if we need to we can do that.)

In any case, I think it’s a bad idea - way too racy - to allow the thread plan stack to change while the thread is running, and I can’t think of a good reason to allow that. So I don’t think you need to factor that case into your design.

If this causes problems the correct (IMO) solution is to disallow this more fundamentally…

@jingham, thanks.

I think I might have solved this issue by ensuring only one ThreadPlanSingleThreadTimeout is in alive.

Let me step back a bit and ask several fundamental questions which I am not sure I got it right:

  1. How to ensure ThreadPlanSingleThreadTimeout is popped every time internal stop happens?

This is important to ensure your suggested variant that – ThreadPlanSingleThreadTimeout always stays as leaf thread plan.
Originally, I thought ThreadPlanSingleThreadTimeout::MischiefManaged() returning true will achieve that, but double checking find it is not true.
Another way is returning true for ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() which I do not think is right.

  1. When timeout async interrupt internal stop arrives, ThreadPlanSingleThreadTimeout completes. Then I noticed that its parent ThreadPlanStepOverRange::ShouldStop() will be called. Since when async interrupt happens there will be many young stack frames beyond original step range frame. This can cause several step-out thread plans to be pushed to get out. I think it would be faster if we can simply resume from the async interrupt to get out? Is there any way to prevent ThreadPlanStepOverRange::ShouldStop() from being called so that we can tell
    process to simply resume instead?

  2. Per my understanding, ThreadPlanSingleThreadTimeout::GetPlanRunState() needs to call GetPreviousPlan()->GetPlanRunState() to forward because itself does not have knowledge to perform instruction stepping or simply run. However, GetPlanRunState() is a protected method. What’s your suggestion?

Let me say what I thought this was going to do, and see if we can converge on something…

One simplification I want to make here is that these “one thread timeouts” are really about preventing stalls. So I think it’s fine that every time we restart we use the same timeout. After all, we want to keep on this thread “as long as it is making progress”. We aren’t saying “I guarantee I will run in one thread state for X seconds”. We’re saying “I take a stall of X seconds to mean this one thread isn’t going to make progress on its own”. So we don’t need to keep track of a decrementing timer or anything global like that. We only need to know that we’re still using the timer.

Suppose the user did a “next”. Then if we were going to use the single thread timeout, we would push a ThreadPlanSingleThreadTimeOut as the leaf node, and continue with only that one thread running. Then in lldb, we’d set ourselves a timer for the one thread interval - if it fires we will send an “interrupt” to the process.

There are two cases now:

1: The process stops before our timer expires:

So ThreadPlanSingleThreadTimeout should always say it explains the stop. Saying you explain the stop doesn’t mean you are the only one that will get to handle the stop. It just means “I have some work to do on this stop”.
ThreadPlanSingleThreadTimeout always has work to do: noting whether the timer has expired or not. If you do this, then MischiefManaged → true should cause the plan to get popped. Then the stop will get propagated up the thread plan stack to the next plan that also explained the stop.

Now there are three sub-cases:

(a) There’s a plan that explained the stop which is a controlling plan and is done with its job:

We wipe our “I was timeout-stepping this thread” info: we aren’t doing that anymore because we’re returning control to the user; and pop the plan as per usual.

(b) There’s an explaining plan but it has more work to do. After all, it may take a series of steps to make our way through a complicated source line. Suppose for instance, we stopped because we instruction single-stepped over a branch, but we the instruction we stepped to is still in the line range we’re next-ing over. Then we have to consult the step next plan to see what to do. It will get asked what to do next in the ordinary course of things, and will set the appropriate breakpoint for the next stage of this step.

Then when we go to continue, the Thread which is getting to run solo will return true for StopOthers and push a ThreadPlanSingleThreadTimeout and continue the process again with a timeout.

(c) The stop reason was not handled by the current controlling plan (e.g. lldb did a next over a function that the user had a breakpoint in). In that case we’re going to stop to return control to the user, so we zero out our timer and stop. I think that’s the most natural behavior. This whole feature is about how thread stepping happens when the user isn’t in control. So it also makes sense to return control over the timeout behavior to them as well. They can choose how to do the next continue of the process.

2: The one thread timer has expired.

We will mark that thread as now in “run all threads” mode. But again, we have to consult the parent plans, because we might accidentally have hit our interrupt timeout when the thread plan was also getting to one of its end stages. For instance, it’s not impossible that lldb sends the interrupt, but before that gets to the debug stub the process already had stopped at one of the controlling plan’s breakpoints.

In this case, we have:

(a) The stop reason was just the interrupt we sent, so we know we should just continue with all threads running. The other plans won’t explain the Interrupt stop reason, so that will propagate to the base thread plan, which can recognize this timer interrupt stop and continue.

(b) We sent the interrupt but got another stop reason back and there’s a controlling plan that explains this and has more work to do, it will continue the process as normal.

(c) Otherwise the unexplained stop reason will make its way to the Base thread plan, which will stop appropriately.

The rule here is that any time we return control to the user, we unset our decision about whether we were timeout-single stepping this thread.

There is one bit you’ll have to watch out for: Normally I try not to starve threads, so if more than one thread says it wants to StopOthers I randomly choose the one that gets to run. But if we are in the mode where Thread A won the right to run solo with a timeout, it needs to keep winning that random choice till it accomplishes its task.

Does this make any sense?

@jingham , yes, that aligns with my understanding generally in high level, but there are various thread plan stack states that breaks some part of my understanding.

As an example, I took the suggestion to return true from ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() which enables me to pop between stops (great!). Then I noticed that since ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() intercepts the internal stop, ThreadPlanStepRange::NextRangeBreakpointExplainsStop() does not get a chance to run which is supposed to clear previous set branch breakpoint. Since the lingering branch breakpoint still exists, when thread resumes again, a new ThreadPlanStepOverBreakpoint is pushed to get over:

1714492134.896792650 intern-state     Thread::ShouldStop(0x55e8f60cbed0) for tid = 0x3f3a64 0x3f3a64, pc = 0x00005555555594ab
1714492134.896821260 intern-state     ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
1714492134.896854639 intern-state     Plan stack initial state:
  thread #1: tid = 0x3f3a64:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Stepping over line main.cpp:579:12 using ranges: [0x00005555555591a2-0x00005555555591b5).
      Element 2: Single stepping past breakpoint site 130 at 0x5555555591af  **<== step over branch breakpoint
      Element 3: Single thread timeout, state(WaitTimeout)

Then thread reaches outside stepping range, all thread plans are popped and complete. Even though ThreadPlanStepOverRange says should stop and complete, ThreadPlanStepOverBreakpoint says we should auto continue which overrides the stop decision and causing the stepping becomes a continue:

1714492134.896997452 intern-state     ThreadPlanSingleThreadTimeout::DoPlanExplainsStop(): got event: stopped.
1714492134.897028446 intern-state     ThreadPlanSingleThreadTimeout::HandleEvent(): got event: stopped.
1714492134.897052765 intern-state     Plan Single thread timeout should stop: 0.
1714492134.897074223 intern-state     ThreadPlanSingleThreadTimeout::MischiefManaged() called.
1714492134.897101641 intern-state     ThreadPlanSingleThreadTimeout::DidPop().
1714492134.897156954 intern-state     ThreadPlanSingleThreadTimeout::TimeoutThreadFunc() called with m_exit_flag(1).
1714492134.897271872 intern-state     Popping plan: "Single thread timeout", tid = 0x3f3a64.
1714492134.897300482 intern-state     Plan Step over breakpoint trap should stop: 0.
1714492134.897323370 intern-state     Completed step over breakpoint plan.
==> **1714492134.897378206 intern-state     Plan Step over breakpoint trap auto-continue: true.**
1714492134.897401333 intern-state     Popping plan: "Step over breakpoint trap", tid = 0x3f3a64.
1714492134.897425652 intern-state     ThreadPlanStepOverRange reached 0x00005555555594ab.
1714492134.897450209 intern-state     ThreadPlanStepOverRange compare frame result: 2.
1714492134.897492409 intern-state     Step range plan out of range to 0x5555555594ab
1714492134.897521734 intern-state     Removing next branch breakpoint: -37.
1714492134.897608757 intern-state     ShouldStopHere callback returned 1 from 0x5555555594ab.
1714492134.897633553 intern-state     Plan Step range stepping over should stop: 1.
1714492134.897656679 intern-state     Completed step through range plan.
1714492134.897679806 intern-state     Popping plan: "Step range stepping over", tid = 0x3f3a64.
1714492134.897711515 intern-state     Plan stack final state:
  thread #1: tid = 0x3f3a64:
    Active plan stack:
      Element 0: Base thread plan.
    Completed plan stack:
      Element 0: Single thread timeout, state(WaitTimeout)
      Element 1: Single stepping past breakpoint site 130 at 0x5555555591af
      Element 2: Stepping over line main.cpp:579:12 using ranges: [0x00005555555591a2-0x00005555555591b5).
1714492134.897744417 intern-state     vvvvvvvv Thread::ShouldStop End (returning 0) vvvvvvvv
1714492134.897766829 intern-state     ThreadList::ShouldStop overall should_stop = 0

Maybe I missed something, but my biggest challenge is that, under current design, the ThreadPlanSingleThreadTimeout needs to know its interaction in various thread plan scenarios making it hard to self-contained/reason about.

I don’t think this is intrinsic to what you are doing. Rather you are running into an asymmetry in how the thread plan negotiation works that has not caused problems to date, but is awkward.

When a stop reason is propagating up the plan stack, each plan gets asked to explain the stop until one of them does. Then it gets to say whether it is done or not. Then things get odd. If the plan that explained the stop was a “controlling” plan - i.e. this is the plan that expressed some user intention - we stop. If not we don’t go back to the top of our loop, instead we break, and go on to a pass through the rest of the plans that doesn’t do explain stop. This is what the done_processing_current_plan in Thread::ShouldStop is for.

There are comments about this in the ThreadPlan header, basically you can’t assume that ExplainsStop will get called before ShouldStop. So plans have to be prepared to do their work either in ExplainsStop or ShouldStop.

It sounds like there’s one of the range plans that isn’t doing this right.

There must have been some reason for this asymmetry, but I can’t remember what is is right now. It might be interesting to see if eliminating that second processing step and instead of break-ing when the explaining plan isn’t a controlling plan, go back to the top of the loop and keep handling plans in the same way.

That might break horribly for the reason I’m currently forgetting. But if we can make that work it would be a more principled way to make the plan stack less affected by the introduction of new worker plans like your timeout one.

@jingham, thanks a lot of all the suggestion/feedback, really appreciate!

New ThreadPlanSingleThreadTimeout to resolve potential deadlock in single thread stepping by jeffreytan81 · Pull Request #90930 · llvm/llvm-project · GitHub is the PR that can work for all our internal scenarios per my testing.

Welcome any feedback to polish it.