Hello everyone,
recently I’ve found myself looking at our “manual” DWARF indexing code, and I was wondering if it could be made faster.
I know this is an area that has been optimized again and again over the years, but all of these endeavours were tackling the problem at a relatively low level (mainly by making ConstString lookups faster). I was wondering if a slightly higher level approach could make things faster. And I think the answer to that question is – with some caveats – yes.
The most expensive part of the DWARF indexing process – despite all past optimizations – still is the ConstString construction process. Both due to the (repeated) hashes/copies of the string and due to the contention on the string pool locks. This is why my prototype does not use it. Instead it works with strings from the debug info directly. It also makes use of the fact the debug info strings are usually deduplicated already, which means we can use pointer identity to cut down on the number of string operations (we hash a string once even if it is referred to by N classes/functions/etc.).
The strings are entries are then placed into a vector which is sorted according to the hash of the string (but not the string itself). To make better use of the multiple cpus, the list is sharded into 4 pieces and each is then populated and sorted in parallel.
Together this translates to approximately 10%-30% (or 3-to-10 seconds in absolute terms) speedup in DWARF indexing time, depending on the number of CPUs available. For the test I used a very large internal binary (6 GB in the main file + 6GB DWP). The binary was built with split dwarf (packaged into a DWP) and simple template names. I’m attaching the data I’ve gathered at the end of this message.
Now for the fine print:
- These improvements come at the cost of slower index lookups. Whereas currently it is sufficient to hash a pointer value (maybe also hash a string if you don’t have a ConstString to begin with), now you have to hash a string as well as perform any number of string comparisons. Unfortunately, these are harder to quantify (if anyone has suggestions, I’m all ears), but I don’t expect this to be a problem for several reasons:
- Hash collisions are rare. While certainly possible, I haven’t found a single hash collision in the aforementioned large binary – the 32-bit hash always uniquely identified a string.This means that ~every string comparison is going to be followed by more work – which is probably going to be (a lot) more expensive than the string comparison (e.g. DWARF parsing)
- I think users are not likely to notice a tiny (at most a few ms) slowdown in some operations, but they definitely notice the startup time, while they’re unable to do anything
- This makes the manual index similar to how the other indexes work, which also rely on string hashing
- The memory impact of this has (surprisingly) also proven difficult to measure. On one hand, we have the increase of a single index entry (16->24 bytes), but OTOH we also avoid inserting a lot of strings into the string pool. Depending on how I count the bytes, these balance out one way or the other, but overall the impact doesn’t appear to be big. Counting the bytes by hand I get an increase of about 200MB (for context, that’s about than 10% of the ConstString pool), while tcmalloc reports a decrease of ~2GB. The first value is probably closer to reality as I think the decrease is caused by the fact that in the “before” case the ConstString pool happened to have a very low utilization ratio (just over 50%) and tcmalloc counts all of this memory as “used”. I think this should be fine as this example uses a particularly large binary (200MB is 1.6% of the binary+debug info, so for a “small” binary of say 1GB we’re looking at extra ~16 megabytes). The thing I also like about this approach is that unlocks the ability to reduce the ConstString pool even further – so far the reductions have been modest because a lot of the strings end up in the pool anyway due to the symtab parsing. If we did something similar there as well, I think the reductions would be much larger.
- The prototype ignores/disables lldb’s on-disk cache. I did that because the format of the cache is tied to the in-memory representation, and I’ve been experimenting with that a lot. Adding it back in isn’t a problem – we just need to figure out whether we want to reuse the same format (convert it on the fly) or bump the format version number.
- the prototype also ignores ObjectiveC. The situation here is slightly trickier because some of the strings being indexed (the “full name without category” at least) here are not present in the debug info. However, I also don’t indexing performance of ObjC code is particularly concerning for anyone (Apple targets always use compiler-generated indexes), which is why I propose to keep the ConstString implementation for now.
- the measurements were made in a specific configuration that matters the most for my use case (and because I cannot build this binary without split dwarf) – binary+DWP package. The speed benefit can be different depending on the exact configuration, but I have no reason to believe that it should be negative. I’ve also done a quick test by building the same binary as a bunch of DWO files (which I think is close to a worst case for this algorithm as there’s much less string deduplication) and the speedup was still around 20%, mainly because the slowest part here is loading all these files, and I’m not doing anything about that. In particular, I expect that the basic “simple dwarf in a single file” case should behave very similar to the DWP scenario.
The promised data (I also have these in spreadsheet form, but discourse will not let me attach that):
Column 1 | Column 2 | Column 3 | Column 4 | E | F | G | H | I | J | K | L | M |
---|---|---|---|---|---|---|---|---|---|---|---|---|
# of CPUs | Before | After | Speedup | |||||||||
index | real | user | sys | index | real | user | sys | index | real | user | sys | |
1 | 80.6240284 | 107.8646 | 101.6276 | 5.9284 | 69.708107 | 95.467 | 90.5454 | 4.793 | 13.54% | 11.49% | 10.90% | 19.15% |
2 | 43.5178524 | 70.323 | 102.5882 | 5.7318 | 37.4008982 | 63.2192 | 92.9868 | 4.9298 | 14.06% | 10.10% | 9.36% | 13.99% |
4 | 23.9603476 | 50.8498 | 104.0514 | 6.2966 | 19.8062866 | 45.5366 | 92.568 | 5.115 | 17.34% | 10.45% | 11.04% | 18.77% |
6 | 18.046839 | 44.9888 | 105.315 | 6.7676 | 14.233697 | 39.9608 | 93.1018 | 5.3244 | 21.13% | 11.18% | 11.60% | 21.33% |
8 | 15.1733502 | 41.9 | 105.9692 | 7.119 | 11.565655 | 37.1354 | 94.0064 | 5.435 | 23.78% | 11.37% | 11.29% | 23.66% |
12 | 12.4075434 | 39.1226 | 107.8122 | 8.2642 | 8.797479 | 34.534 | 96.074 | 5.907 | 29.10% | 11.73% | 10.89% | 28.52% |
16 | 10.9954796 | 37.7014 | 109.2016 | 9.1818 | 7.5542892 | 33.111 | 96.7748 | 6.2404 | 31.30% | 12.18% | 11.38% | 32.04% |
24 | 9.840055 | 36.5186 | 113.2394 | 11.4048 | 6.571277 | 32.1342 | 99.8086 | 7.3892 | 33.22% | 12.01% | 11.86% | 35.21% |
32 | 9.6142774 | 36.436 | 119.1362 | 15.5452 | 6.2365878 | 31.8726 | 102.3902 | 8.1144 | 35.13% | 12.52% | 14.06% | 47.80% |
40 | 9.6252936 | 36.363 | 123.0152 | 19.1952 | 6.2850254 | 31.9494 | 104.5866 | 8.949 | 34.70% | 12.14% | 14.98% | 53.38% |
48 | 9.6543632 | 36.5012 | 127.093 | 22.6552 | 6.299849 | 32.2302 | 107.6502 | 9.4982 | 34.75% | 11.70% | 15.30% | 58.07% |
56 | 9.880139 | 36.6712 | 137.6308 | 26.0662 | 6.3419334 | 32.2838 | 115.4764 | 10.0004 | 35.81% | 11.96% | 16.10% | 61.63% |
64 | 9.8325604 | 36.5754 | 144.4644 | 29.279 | 6.37576 | 32.3144 | 120.9148 | 10.1406 | 35.16% | 11.65% | 16.30% | 65.37% |
72 | 9.943612 | 36.6 | 149.7128 | 34.1372 | 6.4304142 | 32.1288 | 126.1826 | 10.3782 | 35.33% | 12.22% | 15.72% | 69.60% |
80 | 10.1495998 | 36.9084 | 156.2808 | 37.7046 | 6.6184212 | 32.4224 | 130.1554 | 11.362 | 34.79% | 12.15% | 16.72% | 69.87% |
88 | 10.1908064 | 36.9686 | 159.1982 | 38.5558 | 6.7097278 | 32.3348 | 133.1608 | 11.5648 | 34.16% | 12.53% | 16.36% | 70.01% |
96 | 10.1680998 | 36.9188 | 160.0384 | 42.891 | 6.6944972 | 32.3522 | 136.2802 | 11.7752 | 34.16% | 12.37% | 14.85% | 72.55% |