Lldb-server significant slowdown with 3000 threads

Hi,

We are noticing a significant performance issue while using lldb debugging an internal service. Turns out the performance issue is caused by the service using multiple thread pools creating 4000~5000 threads (yeah, pretty bad :frowning: ). Benchmarking with and without debugger shows around 6~7 minutes slow down caused by lldb attaching while gdb incurs very minimum slow down.

Further digging shows, the bottleneck is caused by the following code calling waitpid() repeatedly in NativeProcessLinux::SigchldHandler:

bool checked_main_thread = false;
  for (const auto &thread_up : m_threads) {
    if (thread_up->GetID() == GetID())
      checked_main_thread = true;
 
    if (std::optional<WaitStatus> status = HandlePid(thread_up->GetID()))
      tid_events.try_emplace(thread_up->GetID(), *status);
  }

You can reproduce the slow down with the code below:

Without debugger, it finishes under 2 seconds while, with debugger, it costs more than 10 minutes.

This change seems to be introduced in ⚙ D116372 [lldb-server/linux] Fix waitpid for multithreaded forks. After reverting the change, the lldb wall-time drops from 10minutes to under 3 seconds.

@labath, what do you think how to solve it? Thanks. cc @clayborg

Jeffrey

Hello Jeffrey,

thanks for bringing this to my attention. As I mentioned in :gear: D116372 [lldb-server/linux] Fix waitpid for multithreaded forks, there are basically two ways to solve the problem that it is trying to solve (multiple NativeProcess instances stealing waitpid events from one another). The patch implemented the second one because it made the code cleaner. I did not expect it to have that much of a performance impact, though in retrospect, I probably should have.

I think this means we need to implement the first option instead. There is nothing fundamentally hard there, it just requires redesigning some of the interfaces around the NativeProcess class to enable us to centrally listen for waitpid events and then dispatch them to the appropriate process. The main thing which makes that complicated is that there the notifications for clone child threads can’t be associated with a process without the corresponding notification on the parent thread (and the two can come in any order). This means one has to have some sort of a central repository of threads that will be assigned to a specific process once their parent is known. Nothing impossible – just work.

I think I can take a stab at this next week, but if you want, you can try to implement something sooner.

1 Like

@labath, make sense. I am not an expert in ptrace or the code around this so you will be a better person to fix it. Let me know how it goes.

Thanks
Jeffrey

Hello Jeffrey,

please check out D146977. It actually turned out easier than I expected since I could reuse the existing Factory class as the “central thread repository”. I haven’t tested the performance, but I would expect it to be roughly on par with the pre-D116372 world.

(Technically the algorithm is still quadratic, but now the only quadratic part is the iteration through the NativeProcessProtocol thread list. We could also fix that by using a different data structure to store the threads.)

1 Like