On Nov 2, 2023, at 11:01 AM, Jeffreytan81 via LLVM Discussion Forums firstname.lastname@example.org wrote:
Thank you for the insights. This is incredibly helpful!
This is my first time delving into the ThreadPlan logic, and I’ve only done a brief read on single-thread stepping. Regarding the performance bottleneck, the biggest challenge lies in determining whether it’s latency or CPU-bound. Based on my investigation of a step operation taking 8-10 seconds, it appears that there are 8 internal/private stops, 3 of which involve stopping other threads, while the other 5 require resuming all threads.
I’ve profiled LLDB during the stepping process using the Linux perf tool, which reports various CPU bottlenecks at around 75%. The most critical path appears to be in
ProcessGDBRemote::GetThreadStopInfoFromJSON(), called from
ProcessGDBRemote::CalculateThreadStopInfo(). It seems that we enumerate through each thread and attempt to find its stop information based on TID from
m_jstopinfo_sp. In the case of stopping all threads, the
m_jstopinfo_sp JSON can become quite large, potentially O(N^2). There’s a similar issue in
ThreadList::GetBackingThread(). I’ve made a quick prototype using a hash table to map
<TID => stop_info>, and it seems to improve stepping performance by around 10-20%, although it’s not as significant as single-thread stepping. I’ve also noticed various JSON parsing operations in
ProcessGDBRemote::SetThreadStopInfo(). It seems that the size of
jstopinfo is much larger when stopping all threads.
That was supposed to be “stepping all threads” right?
Ingesting the thread list is the main piece of work here, so it’s not surprising that’s what is showing up hot. And that there are some places where we didn’t use the most efficient algorithm because we never had enough threads to notice isn’t terribly surprising. Fixing those is pure goodness!
If your program’s thread population is growing or fluctuating, rather than being at a steady state, then you will indeed see different numbers of threads reported when you step only one of the non-thread creating threads, as opposed to letting all threads run in the same scenario because the thread creating threads got a chance to run. And it will also increase the latency by the amount of time those other threads get a chance to run.
But that’s just postponing the inevitable, that work is going to get done at some point, and the threads will eventually get created. Plus, your speedup comes at the expense of altering program behavior which we try to avoid.
Other than that, I can’t think of any plausible reason why the thread numbers would vary between single thread running and all threads running. At present lldb-server’s contract is that it report all threads to us. It would be surprising if it was not doing that currently, and even weirder if that was gated on whether the last continue ran only one thread or not.
Another possibility for the slowness is latency-bound, as you’ve mentioned. Resuming/pausing 3000+ threads and synchronously waiting for them can indeed take some time.
I’ve also experimented with setting
set target.process.experimental.os-plugin-reports-all-threads to false, hoping it would reduce
jstopinfo, but I haven’t seen any performance improvement with this setting. Currently, it’s not entirely clear to me what is causing the slowness, but if you ask me, it’s most likely due to the latency issue mentioned above.
That setting just means lldb is READY to handle threads not being reported on all stops. But I very much doubt that lldb-server ever does that, so you wouldn’t expect this to make any difference without a comparable change in lldb-server. If anything, turning the setting on when the stub is always reporting all threads will make things worse, since we can’t retire threads that no longer exist, we instead transfer them to a list of “pended” threads and every time we see a new TID we have to ask if it’s in our list of pended threads first before making a new Thread object to represent it. We will prune the pended threads when you do a full “thread list” but I bet if you have 3000 threads you don’t do that very often.
We could probably make this a little nicer by fetching the full thread list right before we return control to the user. Since most steps through real code have 5-10 internal stops that would still be a decent speedup, w/o causing us to carry around so many pended threads.
The second and third responses are quite intriguing. I might explore some of them if this turns out to be a high-priority issue. I had the same question in my mind while examining the profile trace—whether we really need to update the stop info during private/internal stops. However, I haven’t delved deep enough to propose any solutions. It’s great that you and Pavel have already explored this to some extent. I wonder if, to reduce the size of
target.process.experimental.os-plugin-reports-all-threads to false is sufficient or if further optimizations are needed?
All of the logic that we do to prepare for the next time the ThreadPlans continue the process is based on thread stop info’s. The majority of those continues are from private stops. So no, we can’t do our job w/o knowing the stop info of all the threads we have to reason about at every stop. Note, if a thread has a trivial stop reason, then we just skip over it in the “what to do next” negotiations. That’s why not telling us about those threads doesn’t change the stepping decisions. But if you do tell us about a thread, we will need to know its stop reason to know what to do with it.
As I said above, the setting requires an equivalent change in lldb-server (to not report threads with no stop reason) for it to make a positive difference.
That bit of the profile isn’t terribly surprising…
Hope this helps,
Snippet of the profile trace:
| --48.63%--lldb_private::process_gdb_remote::ProcessGDBRemote::GetThreadStopInfoFromJSON(lldb_private::process_gdb_remote::ThreadGDBRemote*, std::shared_ptr<lldb_private::StructuredData::Object> const&)
| | |
| | --12.15%--lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(unsigned long, std::map<unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, unsigned char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long, bool, lldb_private::LazyBool, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, lldb::QueueKind, unsigned long)
| | |
| | |--5.16%--lldb_private::ThreadList::GetBackingThread(std::shared_ptr<lldb_private::Thread> const&)
| | |
| | |--2.94%--lldb_private::ThreadList::FindThreadByProtocolID(unsigned long, bool)
| | |
| | |
| | |--11.50%--lldb_private::ThreadList::Update(lldb_private::ThreadList&)
| | | |
| | | --0.89%--lldb_private::Thread::GetBackingThread() const
| | |
| | |--3.08%--lldb_private::ThreadPlanStackMap::Update(lldb_private::ThreadList&, bool, bool)
| | | |
| | | --2.95%--lldb_private::ThreadList::FindThreadByID(unsigned long, bool)
| | |
| | |--2.76%--lldb_private::Thread::GetBackingThread() const
Visit Topic or reply to this email to respond.
To unsubscribe from these emails, click here.