[RFC] Speeding up DWARF indexing (again)

Hello everyone,

recently I’ve found myself looking at our “manual” DWARF indexing code, and I was wondering if it could be made faster.

I know this is an area that has been optimized again and again over the years, but all of these endeavours were tackling the problem at a relatively low level (mainly by making ConstString lookups faster). I was wondering if a slightly higher level approach could make things faster. And I think the answer to that question is – with some caveats – yes.

The most expensive part of the DWARF indexing process – despite all past optimizations – still is the ConstString construction process. Both due to the (repeated) hashes/copies of the string and due to the contention on the string pool locks. This is why my prototype does not use it. Instead it works with strings from the debug info directly. It also makes use of the fact the debug info strings are usually deduplicated already, which means we can use pointer identity to cut down on the number of string operations (we hash a string once even if it is referred to by N classes/functions/etc.).

The strings are entries are then placed into a vector which is sorted according to the hash of the string (but not the string itself). To make better use of the multiple cpus, the list is sharded into 4 pieces and each is then populated and sorted in parallel.

Together this translates to approximately 10%-30% (or 3-to-10 seconds in absolute terms) speedup in DWARF indexing time, depending on the number of CPUs available. For the test I used a very large internal binary (6 GB in the main file + 6GB DWP). The binary was built with split dwarf (packaged into a DWP) and simple template names. I’m attaching the data I’ve gathered at the end of this message.

Now for the fine print:

  • These improvements come at the cost of slower index lookups. Whereas currently it is sufficient to hash a pointer value (maybe also hash a string if you don’t have a ConstString to begin with), now you have to hash a string as well as perform any number of string comparisons. Unfortunately, these are harder to quantify (if anyone has suggestions, I’m all ears), but I don’t expect this to be a problem for several reasons:
    • Hash collisions are rare. While certainly possible, I haven’t found a single hash collision in the aforementioned large binary – the 32-bit hash always uniquely identified a string.This means that ~every string comparison is going to be followed by more work – which is probably going to be (a lot) more expensive than the string comparison (e.g. DWARF parsing)
    • I think users are not likely to notice a tiny (at most a few ms) slowdown in some operations, but they definitely notice the startup time, while they’re unable to do anything
    • This makes the manual index similar to how the other indexes work, which also rely on string hashing
  • The memory impact of this has (surprisingly) also proven difficult to measure. On one hand, we have the increase of a single index entry (16->24 bytes), but OTOH we also avoid inserting a lot of strings into the string pool. Depending on how I count the bytes, these balance out one way or the other, but overall the impact doesn’t appear to be big. Counting the bytes by hand I get an increase of about 200MB (for context, that’s about than 10% of the ConstString pool), while tcmalloc reports a decrease of ~2GB. The first value is probably closer to reality as I think the decrease is caused by the fact that in the “before” case the ConstString pool happened to have a very low utilization ratio (just over 50%) and tcmalloc counts all of this memory as “used”. I think this should be fine as this example uses a particularly large binary (200MB is 1.6% of the binary+debug info, so for a “small” binary of say 1GB we’re looking at extra ~16 megabytes). The thing I also like about this approach is that unlocks the ability to reduce the ConstString pool even further – so far the reductions have been modest because a lot of the strings end up in the pool anyway due to the symtab parsing. If we did something similar there as well, I think the reductions would be much larger.
  • The prototype ignores/disables lldb’s on-disk cache. I did that because the format of the cache is tied to the in-memory representation, and I’ve been experimenting with that a lot. Adding it back in isn’t a problem – we just need to figure out whether we want to reuse the same format (convert it on the fly) or bump the format version number.
  • the prototype also ignores ObjectiveC. The situation here is slightly trickier because some of the strings being indexed (the “full name without category” at least) here are not present in the debug info. However, I also don’t indexing performance of ObjC code is particularly concerning for anyone (Apple targets always use compiler-generated indexes), which is why I propose to keep the ConstString implementation for now.
  • the measurements were made in a specific configuration that matters the most for my use case (and because I cannot build this binary without split dwarf) – binary+DWP package. The speed benefit can be different depending on the exact configuration, but I have no reason to believe that it should be negative. I’ve also done a quick test by building the same binary as a bunch of DWO files (which I think is close to a worst case for this algorithm as there’s much less string deduplication) and the speedup was still around 20%, mainly because the slowest part here is loading all these files, and I’m not doing anything about that. In particular, I expect that the basic “simple dwarf in a single file” case should behave very similar to the DWP scenario.

The promised data (I also have these in spreadsheet form, but discourse will not let me attach that):

Column 1 Column 2 Column 3 Column 4 E F G H I J K L M
# of CPUs Before After Speedup
index real user sys index real user sys index real user sys
1 80.6240284 107.8646 101.6276 5.9284 69.708107 95.467 90.5454 4.793 13.54% 11.49% 10.90% 19.15%
2 43.5178524 70.323 102.5882 5.7318 37.4008982 63.2192 92.9868 4.9298 14.06% 10.10% 9.36% 13.99%
4 23.9603476 50.8498 104.0514 6.2966 19.8062866 45.5366 92.568 5.115 17.34% 10.45% 11.04% 18.77%
6 18.046839 44.9888 105.315 6.7676 14.233697 39.9608 93.1018 5.3244 21.13% 11.18% 11.60% 21.33%
8 15.1733502 41.9 105.9692 7.119 11.565655 37.1354 94.0064 5.435 23.78% 11.37% 11.29% 23.66%
12 12.4075434 39.1226 107.8122 8.2642 8.797479 34.534 96.074 5.907 29.10% 11.73% 10.89% 28.52%
16 10.9954796 37.7014 109.2016 9.1818 7.5542892 33.111 96.7748 6.2404 31.30% 12.18% 11.38% 32.04%
24 9.840055 36.5186 113.2394 11.4048 6.571277 32.1342 99.8086 7.3892 33.22% 12.01% 11.86% 35.21%
32 9.6142774 36.436 119.1362 15.5452 6.2365878 31.8726 102.3902 8.1144 35.13% 12.52% 14.06% 47.80%
40 9.6252936 36.363 123.0152 19.1952 6.2850254 31.9494 104.5866 8.949 34.70% 12.14% 14.98% 53.38%
48 9.6543632 36.5012 127.093 22.6552 6.299849 32.2302 107.6502 9.4982 34.75% 11.70% 15.30% 58.07%
56 9.880139 36.6712 137.6308 26.0662 6.3419334 32.2838 115.4764 10.0004 35.81% 11.96% 16.10% 61.63%
64 9.8325604 36.5754 144.4644 29.279 6.37576 32.3144 120.9148 10.1406 35.16% 11.65% 16.30% 65.37%
72 9.943612 36.6 149.7128 34.1372 6.4304142 32.1288 126.1826 10.3782 35.33% 12.22% 15.72% 69.60%
80 10.1495998 36.9084 156.2808 37.7046 6.6184212 32.4224 130.1554 11.362 34.79% 12.15% 16.72% 69.87%
88 10.1908064 36.9686 159.1982 38.5558 6.7097278 32.3348 133.1608 11.5648 34.16% 12.53% 16.36% 70.01%
96 10.1680998 36.9188 160.0384 42.891 6.6944972 32.3522 136.2802 11.7752 34.16% 12.37% 14.85% 72.55%
1 Like

It’s interesting that speedup for 2 and 3 CPUs is slightly less that for 1. Cache contention perhaps?
Is the number of CPUs here the number of the maps shards?

Looks cool!
edit: Never mind, found the paralellism

Interesting question. Looking at the raw data (each value in the table is computed as an average of five runs), I see that the very first run for the BEFORE column has a slightly higher (3s) run time than the rest – most likely because of a cold start. The AFTER column doesn’t have that, probably because I was running the built binary to check that it works before kicking off the benchmark script.

If I remove the outlier, the first row becomes:

Column 1 Column 2 Column 3 Column 4 E F G H I J K L M
# of CPUs Before After Speedup
index real user sys index real user sys index real user sys
1 80.56784425 107.2575 101.524 5.5935 69.708107 95.467 90.5454 4.793 13.48% 10.99% 10.81% 14.31%

… which doesn’t completely remove the discrepancy, but it makes it much lower. It’s possible the rest could be explained by hyperthreading. I’m getting slightly lower numbers when restricting the process to threads of two different cores than I do for threads on the same core (which is how it was done with the benchmark). If the new algorithm is more susceptible to that (because it makes better use of the core resources), then this could translate to a slightly lower speedup factor in the two-thread scenario (the difference wouldn’t be visible in other measurements as they’re multiples of two). I haven’t tried to confirm this hypothesis.

I think the trade-off between faster startup time and slightly slower lookups makes sense.

The statistics dump command shows you the string pool utilization so that should be fairly easy to confirm:

(lldb) statistics dump
...
  "memory": {
    "strings": {
      "bytesTotal": 3145728,
      "bytesUnused": 1432792,
      "bytesUsed": 1712936
    }
  },
...

This is true, but we did lose a bit of indexing when we moved to debug names. I don’t remember the details off the top of my head, but I remember making changes to clang to emit more information in the debug_info because it was only encoded in apple_objc. I also don’t remember if that goes through the manual DWARF index, so maybe this I unrelated.

:thumbsup:

I did that. That’s how I know the utilization was around 50%. The thing I’m less sure about is how tcmalloc counts memory usage. The simplest implementation would count all of this as “used”, but its report does contain some data it gets straight from the kernel so it’s conceivable that it could exclude unmodified pages using that. I think this isn’t the case, but I haven’t tried to prove it.

I’m pretty sure it isn’t, because the index works at a per-unit granularity. As soon as the unit is mentioned in the debug_names index, the manual index won’t touch it.

What may be relevant is that if you’ve found a way to make some of the objc stuff work without the explicit objc accelerator section, then we may be able to reuse that in the manual index as well (and index less things?, and then maybe not need ConstString for objc??). It feels this should be possible because debug_names only indexes strings that are present in the debug_info, so there should never be a need to “invent” new strings in the manual index as well.