[RFC] Improve Dwarf 5 .debug_names Type Lookup/Parsing Speed

Thank you for sharing the proposal, but I’m not convinced that it would provide a significant performance improvement.

For B::C, it would require querying all entries of B and then all entries of C to cross-check, resulting in an O(num_B * num_C) complexity. This process would need to be done recursively for A vs. B and across the parent chain. I’m not sure if this approach would be any better than building a reverse mapping once. In any case, both methods seem to be much slower than the O(num_C) approach of DW_IDX_type_hash.

I think the bigger problem of template / simple / canonical names remain

In the hot path we encountered, the declaration context comes from the compiler generated Dwarf, not from human input (I’ve attached the performance trace of the hot path at the end). So, it involves comparing the compiler’s output declaration context with the compiler’s output in .debug_names, and there is no concern about canonical names in this context. I’m not sure if there’s a valid scenario for human input to trigger the hot path.

|                      --89.09%--SymbolFileDWARF::ResolveTypeUID(DWARFDIE const&, bool)
|                                SymbolFileDWARF::ResolveType(DWARFDIE const&, bool, bool)
|                                SymbolFileDWARF::GetTypeForDIE(DWARFDIE const&, bool)
|                                SymbolFileDWARF::ParseType(lldb_private::SymbolContext const&, DWARFDIE const&, bool*)
|                                DWARFASTParserClang::ParseTypeFromDWARF(lldb_private::SymbolContext const&, DWARFDIE const&, bool*)
|                                DWARFASTParserClang::ParseStructureLikeDIE(lldb_private::SymbolContext const&, DWARFDIE const&, ParsedDWARFTypeAttributes&)
|                                SymbolFileDWARFDwo::FindDefinitionTypeForDWARFDeclContext(DWARFDIE const&)
|                                SymbolFileDWARF::FindDefinitionTypeForDWARFDeclContext(DWARFDIE const&)
|                                lldb_private::DebugNamesDWARFIndex::GetTypes(DWARFDeclContext const&, llvm::function_ref<bool (DWARFDIE)>)
|                                |
|                                 --89.03%--lldb_private::DebugNamesDWARFIndex::ProcessEntry(llvm::DWARFDebugNames::Entry const&, llvm::function_ref<bool (DWARFDIE)>)
|                                           |
|                                           |--86.21%--SymbolFileDWARF::GetDIE(DIERef const&)
|                                           |          |
|                                           |           --86.20%--DWARFDebugInfo::GetDIE(DIERef const&)
|                                           |                     DWARFUnit::GetDIE(unsigned long)
|                                           |                     |
|                                           |                      --86.20%--DWARFUnit::ExtractDIEsIfNeeded()
|                                           |                                |
|                                           |                                |--85.39%--DWARFUnit::ExtractDIEsRWLocked()

I think you are right that type hashes are probably simpler and faster, but it’s worth pointing out that the parent alternative is not that bad (see below). Besides these, do we gain/lose anything by going with either approach? AFAICT they both suffer from the same drawbacks of canonical names, but is there anything that one implementation enables which the other does not?

This is not quite accurate. For B::C, we’d query the table for B and build a set of "B"s B_set and similarly for C C_set. For each parent P of an entry in C_set, we’d query B_set for P. The complexity for this is num_C queries to a set plus two queries to the accelerator table.

More generally, if we have N namespaces ending in ::C, we are looking at a worst case scenario of N queries to the accelerator table plus N * (N-1) * num_C set queries (consider the case in which the last namespace is always a mismatch).

(I’ve attached the performance trace of the hot path at the end).

Thank you for sharing, this is pretty much the same path that I have encountered too

In the hot path we encountered, the declaration context comes from the compiler generated Dwarf, not from human input

This is fine, but we need to make sure we are not regressing any other cases when we change the underlying way of finding types.

Looking into this a bit more:

  1. I think we are only doing these queries by using declarations from DWARF to find definitions in DWARF as well. So we should be safe?

  2. We would not be regressing any existing behavior today anyway: the current implementation is comparing AT_names with AT_names. If they were different today, they would still be different with the hash/parent approach. So nothing changes as well?

Ah, OK - so that helps me understand a bit what this is about.

I still think this is not ideal - that it’d create a problem for mixed binaries (GCC+Clang builds, or even some cases of Clang+Clang with version skew (sometimes we change the name printing - like whitespace, parentheses, etc, in C++ template parameters), and some parts of DWARF aren’t canonicalized, like lambdas that include soruce file/line number, which is both neither necessary nor sufficient to uniquely identify a lambda (admittedly there’s no current answer for the lambda - we probably have to use the mangled name there)) & simple template names, perhaps.

The parent chain examining option seems like the most DWARF-ish way to do this owing to the possible variation in the entity names. I guess we could include hashes of the /mangled/ name (in cases where the entity has a mangled name - which is true for anything you’d want to search across CUs for) as that’d be reliable/portable.

Though including the mangled names of types would increase DWARF size - only including their hashes could be more feasible/cheaper.

I still think this is not ideal - that it’d create a problem for mixed binaries

I’m not sure I follow: this problem already exists today, we would not be making it any worse, right? We are improving the – arguably more common – case of a single-compiler-binaries while not changing the status quo of the mixed compiler binaries.

. I guess we could include hashes of the /mangled/ name

This doesn’t really help the case described in the original post. LLDB queries the accelerator tables with input A::B::C (a list of strings). I believe we can’t derive the mangled name from such input: if we could, we would have just queried the accelerator table for the mangled name and the entire problem would be defined away. Does that make sense?

This conversation seems to have fizzled out.

I’m trying to make sense of the last couple of comments: Does “the parent chain examining option” mean implementing DW_IDX_parent in the accelerator table?

I believe that’s what @dblaikie meant.

David, do you have any thoughts regarding my last comment?

Let me try to summarize this discussion in order to move forward.

Consider the problem of finding the DIE for some type, and the input to this query is a fully qualified name like A::B::C. The need for such query can arise in multiple scenarios, but the one with a performance bottleneck is when LLDB needs to evaluate an expression (more on this below).

The current status quo is that LLDB queries the accelerator table for C, producing a DIE that may or may not correspond to A::B::C. Today, LLDB verifies this by parsing the entire compilation unit of that DIE and checking the name of its parents. This is expensive.

When the parents don’t match, the search is repeated for every compilation unit. Mismatches are very common (especially when dealing with the standard library) and therefore expensive for big projects, where LLDB parses thousands of compilation units before finding a match.

(In the algorithm above, the spelling of template names is problematic for the initial accelerator table query of C and also when checking the parents of the DIE found through the query. The spelling problem can involve issues that are not consistent across compilers, like extra spaces, or broader issues like hidden default parameters, i.e., vector<int> is actually vector<int, std::__1::allocator<int>)

There are two standardized accelerator table entry attributes that can speed such queries, neither of which is implemented today.

1- DW_IDX_type_hash

This is an attribute containing the “type hash” of the type represented by this accelerator table entry.

Greg pointed out that this could be similar to what is done in Apple tables, where the accelerator table entry would also contain a hash of the fully qualified name (A::B::C). This would provide a quick way to filter out the results of a query for C in the table.

David points that, according to the spec, this is not supposed to be a hash of the fully qualified name. Instead, the algorithm described in Section 7.32 is supposed to be used. This attribute is meant to allow linkers to deduplicate type declarations across CUs.

The spec allows customs attributes through the use of an augmentation string in the header of the accelerator table, providing a path to the fully qualified name idea.

2- DW_IDX_parent

Given some accelerator table entry X, this is an attribute containing a pointer to the parent of X; more specifically, a pointer to the accelerator table entry for the parent of X.

Due to design subtleties, such a pointer does not allow LLDB to inspect the name for the parent of X; this is not a big problem, LLDB can either use the algorithm described in 1, or parse the DIE of the parent (not the entire CU!) and check its AT_name.

There were concerns that the parent may not exist, or its name is empty and therefore it won’t show up in the accelerator table. This should not be a problem: if LLDB is looking for something that should have a parent, and the entry X found in the accelerator table does not have a parent, we know that X is not what we are looking for.


Let’s go back to the spelling problem.

If a user is the one writing the type name, there are lots of problems to solve, and it doesn’t seem like there are too many options here, or strong motivation to address this problem right now.

In the expression evaluator context, where there is a performance bottleneck, spelling should not be an issue: LLDB uses declarations from DWARF to find definitions in DWARF. To quote Greg:

During type lookups for the expression parser, we often know the decl context that must contain a type like “iterator”, so we know we are looking for std::vector<int>::iterator, but when we do lookups for this type using accelerator tables, we must use the basename iterator and filter out results that don’t match (filter out std::*<T>::iterator where * are all STL classes but keep any whose decl context matches std::vector<int> . If we can’t quickly remove matching accelerator table entries somehow (we can with Apple accelerator tables), we must parse all of the DWARF (parse all DIEs in any matching CUs/TUS) and we currently make clang AST types out of each “iterator” match from all STL classes, and then throw away things by checking the decl context of the matching types that were produced.
So we made the Apple accelerator tables allow us to remove any unneeded entry matches without having to parse ANY dies from any CUs/TUs whose decl context doesn’t match std::vector<int> .


With all of that in mind, I’d like to see if we have consensus on:

  1. Which attribute to implement in the accelerator tables.
  2. We are not making the existing problems with spelling of templates any worse by using such implementation when using accelerator tables to find a type given its fully qualified name.

(2) seems to make sense to me.

As for which implementation to pick, it sounds like we could implement DW_IDX_parent by following the DWARF spec, whereas if we picked DW_IDX_type_hash we would have to bend the spec to do what we want. The extra overhead of parsing the abbreviation and DW_AT_name of each parent doesn’t sound prohibitively expensive to me.

@dblaikie Does that sound reasonable to you so far?

I’m trying to synthesize a more complete response here, sorry it’s taking some time.

(the really short/inarticulate summary is this:

  • DW_IDX_parent will add index size, which is concerning (as the index is an especially important size concern for us - it’s one of the few bits of debug info that remain in the .o/executable when using split DWARF), but maybe it’s fine - data would be good.
  • Using the absence of DW_IDX_parent to conclude that the type has no decl context (it’s in the global namespace) seems problematic, since it wouldn’t distinguish between when a producer just wasn’t producing parent at all, and when it was but the entity had no parent - perhaps we can use the CU as the parent for global namespaced entities to disambiguate this case
  • I still wonder whether there’s another approach that might help size rather than hurt it: It seems really inefficient to look up every instance of iterator for instance when looking for std::vector<int>::iterator - what if each name was looked up, and the one with the least results was used and searched within? I guess that doesn’t work for namespaces foo::bar you can’t be sure you’ll find bar in any arbitrary foo. But it does work for anything with template parameters, because those are closed… well, almost. It doesn’t work for template members of templates - foo<int>::bar<int> the bar<int> might only be in some foo<int>… (but, it still feels bad to incur all this cost for the iterator case - and other cases where the members are constrained and do appear in every instance of the outer entity) - I had this concern jus tabout the debug_names size in general, if we could avoid having to put an entry for every class member in the debug_names… :confused: )

Tangentially related ramblings: [clang][DebugInfo] Revert "emit definitions for constant-initialized static data-members" by Michael137 · Pull Request #74580 · llvm/llvm-project · GitHub

But, I guess, even if we do the “don’t include invariant members in the index” there’d still be a need to lookup the variant members in the index (maybe? unclear how that works for lldb) so maybe building this parent thing would still be needed & might as well be done first/independently of the “let’s not put every member in the index” approach, maybe.

I think we could try DW_IDX_parent approach to see if it is fast enough.

We’ve recently encountered another related issue. Internally, we have enabled the -gsimple-template-name flag by default. However, for targets with Dwarf5 and .debug_names enabled, the combination with -gsimple-template-name is causing LLDB to crash due to a stack overflow.

Upon investigation, we found that enabling -gsimple-template-name results in two issues:

  1. It causes a much larger fanout during type searching (because there are no template parameters to filter). This may slow down the process but not cause a crash.
  2. More importantly, without template parameters, LLDB matches the declaration context for the wrong types (which differ only by template parameters). It then attempts to resolve the type here. This resolution can encounter another forward declaration record and trigger the same type lookup to .debug_names, resulting in an infinite recursion-style stack overflow crash. In our example, one variable’s template parameter using std::pair<A, B> triggers the infinite recursion.

To clarify, I’ve recently realized that there are two entry points to trigger DebugNamesDWARFIndex::GetTypes in lldb: one from SymbolFileDWARF::FindTypes (most likely from user input expression evaluation) and the other from SymbolFileDWARF::FindDefinitionTypeForDWARFDeclContext. The latter is triggered when encountering a forward declaration Dwarf record that LLDB attempts to find/resolve to the definition DIE. Our internal Dwarf5 .debug_names enabled targets are mainly struggling with the second code path because it is automatically triggered when showing local variables, whereas expression evaluation (the first code path) seems to be a rare user scenario we have encountered, at least from the VSCode IDE perspective.

Currently, we have to workaround this issue by disabling -gsimple-template-name for any targets using .debug_names but we will have to find a solution to this issue before using them together.

Btw: @dblaikie, I am curious to know if Google has hit this issue, considering -gsimple-template-name has been enabled by default? Maybe .debug_names is not enabled there yet?

Correct. We have -gsimple-template-names on by default internally, but haven’t enabled .debug_names yet. We could do that in theory for our Dev builds (they use split dwarf and no type units, so that’s been supported in .debug_names in llvm for a while) but we’ve been holding off to get things better implemented, like having type unit support (that meta have been contributing) which is used by some of our builds, and linker merging of .debug_names which @cmtice is working on.

We’re certainly interested in this combination working well, and I’m more than happy to provide any info I can, mostly from the compiler/dwarf generation side about how things work/how I think they should work, etc.

1 Like

FYI I am cleaning up a prototype implementation (the DWARF 5 accelerator table data structures were changed upstream while I was doing this…). If my numbers are correct, it reduced the expression evaluation time from 5000ms to 700ms. I’ll provide some more accurate data soon

1 Like

Sounds beautiful - looking forward to seeing that.

(while you’re measuring things, would be good to know how much the parent attributes increase the size of the indexes - say, in terms of .debug_names increase in a clang build)

Awesome!

Once you got the patch working, I am happy to give a try against our internal targets.
In our case, it is a local variable with template parameter having forward declaration record in dwarf causing type lookup for definition. Without -gsimple-template-name, it takes 50~60 seconds to finish. If we got similar scale of win as your estimation, it would drop to under 10 seconds. Not as fast as we want, but still a great improvement.

Took me a bit longer, but I wanted to be confident the implementation is decently correct, otherwise the data is pointless.

Here is a prototype implementation: [DO NOT MERGE][DebugInfo] Implement debug_names's IDX_parent attribute by felipepiovezan · Pull Request #75365 · llvm/llvm-project · GitHub
(the commits are meant to be looked one at a time)

I built a stage 1 compiler built with the commits above.
Then I built a stage 2 compiler at commit 40e2bb533084 twice: once with the parent attribute and once without (reverting [AsmPrinter][AccelTable] Make IDX_parent on by default in the sage 1 compiler)

Regarding size:

debug_names size in all object files:
With parent    = 475,824,752   (28% increase)
Without parent = 370,166,960  
size of all object files:
With parent    = 11,547,074,344  (0.92% increase)
Without parent = 11,441,527,128
Commands used to measure
obj_files_with_parents=$(find build_stage2_Debug_assert_dwarf5 -name \*.cpp.o)
obj_files_without_parents=$(find build_stage2_Debug_assert_dwarf5_no_idx_parent -name \*.cpp.o)

echo "debug_names size in all object files:"
$stage1_build/bin/llvm-objdump --section-headers  $obj_files_with_parents | \
  grep __debug_names | \
  awk "{s+=\"0x\"\$3}  END {printf \"With parent    = %'d\n\", s}"
$stage1_build/bin/llvm-objdump --section-headers  $obj_files_without_parents | \
  grep __debug_names | \
  awk "{s+=\"0x\"\$3}  END {printf \"Without parent = %'d\n\", s}"

echo "size of all object files:"
ls -la $obj_files_with_parents | \
  awk "{s+=\$5}  END {printf \"With parent    = %'d\n\", s}"
ls -la $obj_files_without_parents | \
  awk "{s+=\$5}  END {printf \"Without parent = %'d\n\", s}"

Speed improvements:

I then used the stage 1 LLDB to debug both stage 2 clangs:

  $stage1_build/bin/lldb \
      --batch \
      -o "b CodeGenFunction::GenerateCode" \
      -o run \
      -o "expr Fn" \
      -- \
      $clang_to_use/bin/clang++ -c -g test.cpp -o /dev/null &> output
  grep "Finished expr in:" output

The measurements are consistent when repeated multiple times:

Finished expr in: 1402ms
Finished expr in: 5941ms

Without -gsimple-template-name , it takes 50~60 seconds to finish. If we got similar scale of win as your estimation, it would drop to under 10 seconds

I think the simple-template-name is somewhat orthogonal to this experiment. For that, you might benefit from Greg’s patch, which improves CompilerDeclContext-based queries. The IDX_parent patch deals with your first performance report (very first post), which is about DWARFDeclContext-based queries. Both are complementary to each other.

30% is a pretty big increase in index size, but if it’s what we need, it’s what we need.

30% is a pretty big increase in index size, but if it’s what we need, it’s what we need.

There is one other thing I can try to reduce this: as a prototype, I was adding the parent attribute to all entries, even those that don’t have a parent (pointing to the end of the table). If it turns out that a substantial number of the entries don’t have parents, we can save some space by using a “parentless” abbreviation