Lack of parallelism

I’ve been trying to improve the parallelism of lldb but have run into an odd roadblock. I have the code at the point where it creates 40 worker threads, and it stays that way because it has enough work to do. However, running ‘top -d 1’ shows that for the time in question, cpu load never gets above 4-8 cpus (even though I have 40).

  1. I tried mutrace, which measures mutex contention (I had to call unsetenv(“LD_PRELOAD”) in main() so it wouldn’t propagate to the process being tested). It indicated some minor contention, but not enough to be the problem. Regardless, I converted everything I could to lockfree structures (TaskPool and ConstString) and it didn’t help.

  2. I tried strace, but I don’t think strace can figure out how to trace lldb. It says it waits on a single futex for 8 seconds, and then is done.

I’m about to try lttng to trace all syscalls, but I was wondering if anyone else had any ideas? At one point I wondered if it was mmap kernel semaphore contention, but that shouldn’t affect faulting individual pages, and I assume lldb doesn’t call mmap all the time.

I’m getting a bit frustrated because lldb should be taking 1-2 seconds to start up (it has ~45s of user+system work to do), but instead is taking 8-10, and I’ve been stuck there for a while.

If you have access to a Windows machine and you can reproduce the slowdown there, there are surprisingly good tools available for diagnosing parallelism and thread contention.

https://github.com/google/UIforETW

I'm not sure about Linux, on OS X lldb will mmap the debug information rather that using straight reads. But that should just be once per loaded module.

Jim

As it turns out, it was lock contention in the memory allocator. Using tcmalloc brought it from 8+ seconds down to 4.2.

I think this didn’t show up in mutrace because glibc’s malloc doesn’t use pthread mutexes.

Greg, that joke about adding tcmalloc wholesale is looking less funny and more serious… Or maybe it’s enough to make it a cmake link option (use if present or use if requested).

The other thing would be to try and move the demangler to use a custom allocator everywhere. Not sure what demangler you are using when you are doing these tests, but we can either use the native system one from the #include <cxxabi.h>, or the fast demangler in FastDemangle.cpp. If it is the latter, then we can probably optimize this.

The other thing to note is local files will be mmap’ed in and paging doesn’t really show up on perf tests very well, so it will look like system time when the system is paging in pages from the symbol files as it reads them from memory. You could try disabling the mmap stuff in DataBufferLLVM.cpp and see if you see any difference. The call to llvm::MemoryBuffer::getFileSlice() takes a Volatile as its last argument. If you set this to true, we will read the file into memory instead of mmap’ing it. This will help you at least see if there is any component of the time that is due to mmap’ing. Currently we look to see if the file is local (not on a network mount). If it is local we mmap it.

Greg

I'm using the demangler I modified here: https://reviews.llvm.org/D32500
I think it still starts with FastDemangle.cpp, but one test showed the
modified llvm demangler is almost as fast (~1.25% slow down by disabling
FastDemangle). I might be able to narrow that further by putting the
initial arena on the stack.

Now that I moved past the parallelism bottleneck, I think I need to revisit
my changes to make sure they're having the desired effect.