Help fixing deadlock in DWARF symbol preloading

Hi,

I’ve found a deadlock in lldb (see attached test case, you can build it with just clang -o test test.s), but I’m a total newbie and I have no idea what’s the right way to fix it.

The problem happens when an error is found during DIE extraction when preloading symbols. As far as I can tell, it goes like this:

  1. Module::PreloadSymbols locks Module::m_mutex
  2. A few layers below it, we end up in ManualDWARFIndex::Index, which dispatches DIE extractions to a thread pool:
  for (size_t i = 0; i < units_to_index.size(); ++i)
    pool.async(extract_fn, i);
  pool.wait();

  1. extract_fn in the snippet above ends up executing DWARFDebugInfoEntry::Extract and when there’s an error during extraction, Module::GetDescription is called while generating the error message.
  2. Module::GetDescription tries to acquire Module::m_mutex from a different thread, while the main thread has the mutex already locked and it’s waiting for DIE extraction to end, causing a deadlock.

If we make Module::GetDescription not lock the problem disappears, so the diagnosis looks correct, but I don’t know what would be the right way to fix it. Module::GetDescription looks more or less safe to call without locking: it just prints m_arch, m_file, and m_object_name to a string, and those look like fields that wouldn’t change after the Module is initialized, so maybe it’s okay? But I feel like there must be a better solution anyway. Any advice?

Best,
Jorge

test.s (4.82 KB)

Please have a look at <https://lists.llvm.org/pipermail/lldb-dev/2020-October/016479.html&gt;, which is the last time this came up.

One quick'n'dirty solution would be to have `Module::ReportError` _try_ to get the module lock, and if it fails, just bail out. That obviously means you won't get to see the error message which triggerred the deadlock (though we could also play around with that and try printing the error message without the module description or something), but it will at least get you past that point...

pl

Wouldn’t it be preferable to try_lock in GetDescription (which is the one currently acquiring the mutex) instead? ReportError doesn’t touch any mutex itself and will happily report the rest of the error if GetDescription bails out. For the test case I sent it would look like this:

error: {0x0000000b}: invalid abbreviation code 123, please file a bug and attach the file at the start of this error message
error: {0x0000000b}: invalid abbreviation code 123, please file a bug and attach the file at the start of this error message

error: {0x0000000b}: invalid abbreviation code 123, please file a bug and attach the file at the start of this error message

which is way better than a deadlock IMO.

We could also just give the Module a std::string with the description
and update it in the few places where we actually update it. The
m_arch already has a setter in place that just needs to be used in a
few more places, so the infrastructure is kind of already there (at
least for m_arch). The description would just have its own mutex.

I just started looking at how to do this (having a separate mutex for the description) and I think I found another bug. Or maybe I’m missing some assumption.

On one hand, Module::SetArchitecture won’t assign the new value if m_arch is already valid, just return m_arch.IsCompatibleWith(new_arch). On the other hand, in Module::MergeArchitecture we have both code that assumes that SetArchitecture replaces the previous value:

if (m_arch.IsCompatibleMatch(arch_spec)) {

// The new architecture is different, we just need to replace it.
return SetArchitecture(arch_spec);
}

and right after it we have code that works around the fact that SetArchitecture doesn’t replace the previous value sometimes

// Merge bits from arch_spec into “merged_arch” and set our architecture.
ArchSpec merged_arch(m_arch);
merged_arch.MergeFrom(arch_spec);
// SetArchitecture() is a no-op if m_arch is already valid.
m_arch = ArchSpec();
return SetArchitecture(merged_arch);

My guess is that the first of the snippets above has a bug (it intends to replace the architecture but doesn’t) and that replacing a module’s architecture with an incompatible one is something that doesn’t happen often and the bug went unnoticed. Does this make sense?

A simpler solution is to remove the lock in Module::GetDescription(…) all of the data that is used to dump the module description is set correctly (the file, the arch and the object name for BSD archives)) very early on when a module is created before anyone should be making symbol or debug info queries or parsing.

I'd be uncomfortable with a function like GetDescription "randomly" doing nothing (returning an empty string, or whatever). OTOH, while it's definitely not ideal, I think I could live with ReportError not reporting anything (or skipping the description, or something) under some circumstances.

That said, I like both Greg's and Raphael's ideas on how to fix this (assuming they work)...

pl