Improve single thread stepping

We have several internal Linux targets that have a non-trivial number of threads (+3000). I acknowledge that this is not ideal, and we’ve provided feedback to the target owners to reduce the thread count.

The high number of threads has resulted in significant slowness when using LLDB for stepping. In some cases, it takes around 10 seconds to step through certain non-trivial statements. We’ve observed that switching to single-thread stepping significantly improves performance, reducing the time to around 3 seconds. However, we are aware that single-thread stepping can be risky and may lead to deadlocks.

To address this issue, I’d like to explore a couple of options:

  1. @clayborg mentioned that there used to be a mode or logic to specify a timeout for single-thread stepping. This timeout would allow the thread plan to resume other threads if it exceeds the timeout, potentially avoiding deadlocks. Has this logic been reverted?
  2. Is it feasible to introduce some form of recovery logic? For instance, when a deadlock occurs, users typically pause asynchronously in LLDB and then resume. Currently, this doesn’t resolve deadlocks. However, we could modify LLDB to resume all threads to address the deadlock situation.
  3. This could potentially be an RFC. The existing while-stepping ThreadPlan prevents other threads from executing unless there are call sites within the step range. This approach is quite conservative. On Linux, is it possible to improve the stepping mode by using the ptrace() to detect syscall from interior only then do we resume other threads? This assumes that only syscalls can hold locks, which may not hold true for user-mode spin locks. We’re unsure of the complexity involved in adding special detection for user-mode spin locks. If this approach is viable, it could allow us to remain in single-thread stepping mode for most call sites.

cc @jingham, @labath

I have a couple of responses.

First:

What part(s) of stepping in this situation are taking the majority of that time? It’s hard to reason about solutions without knowing what the actual problem is.

I’m confused in particular about why only allowing one thread to run speeds things up. The only difference from lldb’s perspective is that in the run one thread case it tells the system to suspend all the other threads, rather than telling it not to. If anything, I would have guessed that having to suspend all the other threads would be more time-consuming, since you generally don’t have to do anything to have the thread continue (making all-thread stepping faster).

So to give a real answer, I’d need some finer-grained data.

Second:

To answer your individual points:

  1. Greg is almost right. There’s a routine in Process called “RunThreadPlan” which is meant to handle running a thread plan a bit on one thread, then falling back to running all threads. When I designed that originally it was meant to be general, but the first use it was put to was for running the expression evaluation thread plans, and since we didn’t have a use it other than that, over time it’s gotten too specific to that purpose. But you certainly could re-extend that machinery to handle general thread plans.

  2. IIUC, getting RunThreadPlan to handle the stepping plans as well as the expression running plans would give you your “Recovery mechanism”. Note, some thread plan uses - e.g. when using the step-instruction plan to step over breakpoints - have to run only one thread, so you’d have to put in some force-one-thread mode to control for that.

  3. If ptrace will stop you synchronously when a syscall occurs, then treating that event as a “run all threads” signal should be fairly straightforward. The current “Explained Stop” in the stepping thread plans for this is “stopping at a branch instruction”. Adding “got the ptrace syscall notification” as another “Explained Stop” in the stepping plans should be straightforward. The hardest part is probably structuring this in a pluggable system so you can support other systems with different such decision points.

As for “I can detect all potential deadlocks in any random code that lldb might get shown” - I’m a little skeptical, but even so, a strategy that catches the common cases, coupled with a fairly long “run on a single thread” timeout from doing #1 might be close enough in effect, even if you can’t catch all the cases that could deadlock a thread.

So the strategies you mention, RunThreadPlan for stepping, with a reasonably long “run on one thread” timeout, and some help for known synchronization points, should not be particularly hard to implement, or at all out of place with the way execution control currently works.

Third:

Another idea Pavel and I discussed a couple years back for situations with lots of threads is that until lldb decides to return control to the user, it really doesn’t need to know about threads that don’t “stop for a reason” (e.g. breakpoint, signal, single-step completion). At any given stop, most threads were just running along when one thread stopped the process, and those threads don’t contribute to any of the “should stop” negotiation we do on stop.

lldb already internally supports a debug stub that only tells it about threads that have stopped “for a reason” at any given stop, so this wouldn’t require much work in lldb to make that happen. You would need to change the stub so it can tell when to return “threads stopped with a reason” and “all threads”. One simple way to do that is to have an extended stop reply packet that fills in only the threads stopped “for a reason” and then lldb can get all threads when it needs to with jThreadsInfo.

Anyway, so to make this work we’d just have to teach the stub (lldb-server) to return only the threads "stopped for a reason " in the stop packet, and set lldb in the forgetful stub mode. If you want to play with this mode, it’s currently set by the overly specific:

target.process.experimental.os-plugin-reports-all-threads

setting, but it’s actually a general feature in lldb. It’s named that because to date OS Plugins were the only ones that actually did this.

For the most part, users want to know that all their threads are persisting, so I don’t think it’s a great UE if we remain ignorant of “uninteresting” threads on user stop. Plus getting all the threads every so often really helps our garbage collection. But most user-level stepping actions involve a bunch of private stops, and for those we definitely don’t need any threads that haven’t stopped “for a reason”.

That strategy would seem promising, since it greatly reduces the data and processing on all these private stops. But OTOH, your saying that “stepping only a single thread is much faster” makes it sound like this isn’t the primary issue, since we’re told about all threads regardless of whether we’re going to let them run or suspend them.

So again, before going much futher with this discussion I’d want to know what is slow here in more detail.

1 Like

Thank you for the insights. This is incredibly helpful!

This is my first time delving into the ThreadPlan logic, and I’ve only done a brief read on single-thread stepping. Regarding the performance bottleneck, the biggest challenge lies in determining whether it’s latency or CPU-bound. Based on my investigation of a step operation taking 8-10 seconds, it appears that there are 8 internal/private stops, 3 of which involve stopping other threads, while the other 5 require resuming all threads.

I’ve profiled LLDB during the stepping process using the Linux perf tool, which reports various CPU bottlenecks at around 75%. The most critical path appears to be in ProcessGDBRemote::GetThreadStopInfoFromJSON(), called from ProcessGDBRemote::CalculateThreadStopInfo(). It seems that we enumerate through each thread and attempt to find its stop information based on TID from m_jstopinfo_sp. In the case of stopping all threads, the m_jstopinfo_sp JSON can become quite large, potentially O(N^2). There’s a similar issue in ThreadList::GetBackingThread(). I’ve made a quick prototype using a hash table to map <TID => stop_info>, and it seems to improve stepping performance by around 10-20%, although it’s not as significant as single-thread stepping. I’ve also noticed various JSON parsing operations in ProcessGDBRemote::SetThreadStopInfo(). It seems that the size of jstopinfo is much larger when stopping all threads.

Another possibility for the slowness is latency-bound, as you’ve mentioned. Resuming/pausing 3000+ threads and synchronously waiting for them can indeed take some time.

I’ve also experimented with setting set target.process.experimental.os-plugin-reports-all-threads to false, hoping it would reduce jstopinfo, but I haven’t seen any performance improvement with this setting. Currently, it’s not entirely clear to me what is causing the slowness, but if you ask me, it’s most likely due to the latency issue mentioned above.

The second and third responses are quite intriguing. I might explore some of them if this turns out to be a high-priority issue. I had the same question in my mind while examining the profile trace—whether we really need to update the stop info during private/internal stops. However, I haven’t delved deep enough to propose any solutions. It’s great that you and Pavel have already explored this to some extent. I wonder if, to reduce the size of jstopinfo, setting target.process.experimental.os-plugin-reports-all-threads to false is sufficient or if further optimizations are needed?

Snippet of the profile trace:

|--49.81%--lldb_private::ThreadList::ShouldStop(lldb_private::Event*)
                          |          |
                          |           --49.43%--lldb_private::Thread::GetStopInfo()
                          |                     |
                          |                      --49.35%--lldb_private::Thread::GetPrivateStopInfo(bool)
                          |                                |
                          |                                 --49.30%--lldb_private::process_gdb_remote::ThreadGDBRemote::CalculateStopInfo()
                          |                                           lldb_private::process_gdb_remote::ProcessGDBRemote::CalculateThreadStopInfo(lldb_private::process_gdb_remote::ThreadGDBRemote*)
                          |                                           |
                          |                                            --48.63%--lldb_private::process_gdb_remote::ProcessGDBRemote::GetThreadStopInfoFromJSON(lldb_private::process_gdb_remote::ThreadGDBRemote*, std::shared_ptr<lldb_private::StructuredData::Object> const&)
                          |                                                      |
                          |                                                      |--12.24%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(lldb_private::StructuredData::Dictionary*)
                          |                                                      |          |
                          |                                                      |           --12.15%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(unsigned long, std::map<unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, unsigned char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, bool, lldb_private::LazyBool, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, lldb::QueueKind, unsigned long)
                          |                                                      |                     |
                          |                                                      |                     |--5.16%--lldb_private::ThreadList::GetBackingThread(std::shared_ptr<lldb_private::Thread> const&)
                          |                                                      |                     |
                          |                                                      |                     |--2.94%--lldb_private::ThreadList::FindThreadByProtocolID(unsigned long, bool)
                          |                                                      |                     |

|--24.73%--lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop()
                          |          |
                          |          |--19.68%--lldb_private::Process::UpdateThreadListIfNeeded()
                          |          |          |
                          |          |          |--11.50%--lldb_private::ThreadList::Update(lldb_private::ThreadList&)
                          |          |          |          |
                          |          |          |           --0.89%--lldb_private::Thread::GetBackingThread() const
                          |          |          |
                          |          |          |--3.08%--lldb_private::ThreadPlanStackMap::Update(lldb_private::ThreadList&, bool, bool)
                          |          |          |          |
                          |          |          |           --2.95%--lldb_private::ThreadList::FindThreadByID(unsigned long, bool)
                          |          |          |
                          |          |          |--2.76%--lldb_private::Thread::GetBackingThread() const


On Nov 2, 2023, at 11:01 AM, Jeffreytan81 via LLVM Discussion Forums notifications@llvm.discoursemail.com wrote:

jeffreytan81
November 2

Thank you for the insights. This is incredibly helpful!

This is my first time delving into the ThreadPlan logic, and I’ve only done a brief read on single-thread stepping. Regarding the performance bottleneck, the biggest challenge lies in determining whether it’s latency or CPU-bound. Based on my investigation of a step operation taking 8-10 seconds, it appears that there are 8 internal/private stops, 3 of which involve stopping other threads, while the other 5 require resuming all threads.

I’ve profiled LLDB during the stepping process using the Linux perf tool, which reports various CPU bottlenecks at around 75%. The most critical path appears to be in ProcessGDBRemote::GetThreadStopInfoFromJSON(), called from ProcessGDBRemote::CalculateThreadStopInfo(). It seems that we enumerate through each thread and attempt to find its stop information based on TID from m_jstopinfo_sp. In the case of stopping all threads, the m_jstopinfo_sp JSON can become quite large, potentially O(N^2). There’s a similar issue in ThreadList::GetBackingThread(). I’ve made a quick prototype using a hash table to map <TID => stop_info>, and it seems to improve stepping performance by around 10-20%, although it’s not as significant as single-thread stepping. I’ve also noticed various JSON parsing operations in ProcessGDBRemote::SetThreadStopInfo(). It seems that the size of jstopinfo is much larger when stopping all threads.

That was supposed to be “stepping all threads” right?

Ingesting the thread list is the main piece of work here, so it’s not surprising that’s what is showing up hot. And that there are some places where we didn’t use the most efficient algorithm because we never had enough threads to notice isn’t terribly surprising. Fixing those is pure goodness!

If your program’s thread population is growing or fluctuating, rather than being at a steady state, then you will indeed see different numbers of threads reported when you step only one of the non-thread creating threads, as opposed to letting all threads run in the same scenario because the thread creating threads got a chance to run. And it will also increase the latency by the amount of time those other threads get a chance to run.

But that’s just postponing the inevitable, that work is going to get done at some point, and the threads will eventually get created. Plus, your speedup comes at the expense of altering program behavior which we try to avoid.

Other than that, I can’t think of any plausible reason why the thread numbers would vary between single thread running and all threads running. At present lldb-server’s contract is that it report all threads to us. It would be surprising if it was not doing that currently, and even weirder if that was gated on whether the last continue ran only one thread or not.

Another possibility for the slowness is latency-bound, as you’ve mentioned. Resuming/pausing 3000+ threads and synchronously waiting for them can indeed take some time.

I’ve also experimented with setting set target.process.experimental.os-plugin-reports-all-threads to false, hoping it would reduce jstopinfo, but I haven’t seen any performance improvement with this setting. Currently, it’s not entirely clear to me what is causing the slowness, but if you ask me, it’s most likely due to the latency issue mentioned above.

That setting just means lldb is READY to handle threads not being reported on all stops. But I very much doubt that lldb-server ever does that, so you wouldn’t expect this to make any difference without a comparable change in lldb-server. If anything, turning the setting on when the stub is always reporting all threads will make things worse, since we can’t retire threads that no longer exist, we instead transfer them to a list of “pended” threads and every time we see a new TID we have to ask if it’s in our list of pended threads first before making a new Thread object to represent it. We will prune the pended threads when you do a full “thread list” but I bet if you have 3000 threads you don’t do that very often.

We could probably make this a little nicer by fetching the full thread list right before we return control to the user. Since most steps through real code have 5-10 internal stops that would still be a decent speedup, w/o causing us to carry around so many pended threads.

The second and third responses are quite intriguing. I might explore some of them if this turns out to be a high-priority issue. I had the same question in my mind while examining the profile trace—whether we really need to update the stop info during private/internal stops. However, I haven’t delved deep enough to propose any solutions. It’s great that you and Pavel have already explored this to some extent. I wonder if, to reduce the size of jstopinfo, setting target.process.experimental.os-plugin-reports-all-threads to false is sufficient or if further optimizations are needed?

All of the logic that we do to prepare for the next time the ThreadPlans continue the process is based on thread stop info’s. The majority of those continues are from private stops. So no, we can’t do our job w/o knowing the stop info of all the threads we have to reason about at every stop. Note, if a thread has a trivial stop reason, then we just skip over it in the “what to do next” negotiations. That’s why not telling us about those threads doesn’t change the stepping decisions. But if you do tell us about a thread, we will need to know its stop reason to know what to do with it.

As I said above, the setting requires an equivalent change in lldb-server (to not report threads with no stop reason) for it to make a positive difference.

That bit of the profile isn’t terribly surprising…

Hope this helps,

Jim

Snippet of the profile trace:

|--49.81%--lldb_private::ThreadList::ShouldStop(lldb_private::Event*)
                          |          |
                          |           --49.43%--lldb_private::Thread::GetStopInfo()
                          |                     |
                          |                      --49.35%--lldb_private::Thread::GetPrivateStopInfo(bool)
                          |                                |
                          |                                 --49.30%--lldb_private::process_gdb_remote::ThreadGDBRemote::CalculateStopInfo()
                          |                                           lldb_private::process_gdb_remote::ProcessGDBRemote::CalculateThreadStopInfo(lldb_private::process_gdb_remote::ThreadGDBRemote*)
                          |                                           |
                          |                                            --48.63%--lldb_private::process_gdb_remote::ProcessGDBRemote::GetThreadStopInfoFromJSON(lldb_private::process_gdb_remote::ThreadGDBRemote*, std::shared_ptr<lldb_private::StructuredData::Object> const&)
                          |                                                      |
                          |                                                      |--12.24%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(lldb_private::StructuredData::Dictionary*)
                          |                                                      |          |
                          |                                                      |           --12.15%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(unsigned long, std::map<unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, unsigned char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, bool, lldb_private::LazyBool, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, lldb::QueueKind, unsigned long)
                          |                                                      |                     |
                          |                                                      |                     |--5.16%--lldb_private::ThreadList::GetBackingThread(std::shared_ptr<lldb_private::Thread> const&)
                          |                                                      |                     |
                          |                                                      |                     |--2.94%--lldb_private::ThreadList::FindThreadByProtocolID(unsigned long, bool)
                          |                                                      |                     |

|--24.73%--lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop()
                          |          |
                          |          |--19.68%--lldb_private::Process::UpdateThreadListIfNeeded()
                          |          |          |
                          |          |          |--11.50%--lldb_private::ThreadList::Update(lldb_private::ThreadList&)
                          |          |          |          |
                          |          |          |           --0.89%--lldb_private::Thread::GetBackingThread() const
                          |          |          |
                          |          |          |--3.08%--lldb_private::ThreadPlanStackMap::Update(lldb_private::ThreadList&, bool, bool)
                          |          |          |          |
                          |          |          |           --2.95%--lldb_private::ThreadList::FindThreadByID(unsigned long, bool)
                          |          |          |
                          |          |          |--2.76%--lldb_private::Thread::GetBackingThread() const


Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

1 Like