Fragmented DWARF

Hi all,

At the recent LLVM developers’ meeting, I presented a lightning talk on an approach to reduce the amount of dead debug data left in an executable following operations such as --gc-sections and duplicate COMDAT removal. In that presentation, I presented some figures based on linking a game that had been built by our downstream clang port and fragmented using the described approach. Since recording the presentation, I ran the same experiment on a clang package (this time built with a GCC version). The comparable figures are below:

Link-time speed (s):

Awesome! Sorry I missed the lightning talk, but really interested to see this sort of thing (though it’s not directly/immediately applicable to the use case I work with - Split DWARF, something similar could be used there with further work)

Though it looks like the patch has mostly linker changes - where/how do you generate the fragmented DWARF to begin with? Via the Python script? Run over assembly? I’d be surprised if it was achievable that way - curious to know more.

Got a rough sense/are you able to run apples-to-apples comparisons with Alexey’s linker-based patches to compare linker time/memory overhead versus resulting output size gains?

(& yeah, I’m a bit curious about how the linkers do eh_frame rewriting, if the format is especially amenable to a lightweight parsing/rewriting and how we could make the DWARF more amenable to that too)

The script included in the patch can be used to convert an object containing normal DWARF into an object using fragmented DWARF. It does this by using llvm-dwarfdump to dump the various sections, parses the output to identify where it should split (using the offsets of the various entries), and then writes new section headers accordingly - you can see roughly what it’s doing if you get a chance to watch the talk recording. The additional section headers are appended to the end of the ELF section header table, whilst the original DWARF is left in the same place it was before (making use of the fact that section headers don’t have to appear in offset order). The script also parses and fragments the relocation sections targeting the DWARF sections so that they match up with the fragmented DWARF sections. This is clearly all suboptimal - in practice the compiler should be modified to do the fragmenting upfront, to save having to parse a tool’s stdout, but that was just the simplest thing I could come up with to quickly write the script. Full details of the script usage are included in the patch description, if you want to play around with it.

If Alexey could point me at the latest version of his patch, I’d be happy to run that through either or both of the packages I used to see what happens. Equally, I’d be happy if Alexey is able to run my script to fragment and measure the performance of a couple of projects he’s been working with. Based purely on the two packages I’ve tried this with, I can tell already that the results can vary wildly. My expectation is that Alexey’s approach will be slower (at least in its current form, but probably more generally), but produce smaller output, but to what scale I have no idea.

I think linkers parse .eh_frame partly because they have no other choice. That being said, I think it’s format is not too complex, so similarly the parser isn’t too complex. You can see LLD’s ELF implementation in ELF/EhFrame.cpp, how it is used in ELF/InputSection.cpp (see the bits to do with EhInputSection) and EhFrameSection in ELF/SyntheticSections.h (plus various usages of these two throughout the LLD code). I think the key to any structural changes in the DWARF format to make them more amenable to link-time parsing is being able to read a minimal amount without needing to parse the payload (e.g. a length field, some sort of type, and then using the relocations to associate it accordingly).

James

The script included in the patch can be used to convert an object containing normal DWARF into an object using fragmented DWARF. It does this by using llvm-dwarfdump to dump the various sections, parses the output to identify where it should split (using the offsets of the various entries), and then writes new section headers accordingly - you can see roughly what it's doing if you get a chance to watch the talk recording. The additional section headers are appended to the end of the ELF section header table, whilst the original DWARF is left in the same place it was before (making use of the fact that section headers don't have to appear in offset order). The script also parses and fragments the relocation sections targeting the DWARF sections so that they match up with the fragmented DWARF sections. This is clearly all suboptimal - in practice the compiler should be modified to do the fragmenting upfront, to save having to parse a tool's stdout, but that was just the simplest thing I could come up with to quickly write the script. Full details of the script usage are included in the patch description, if you want to play around with it.

If Alexey could point me at the latest version of his patch, I'd be happy to run that through either or both of the packages I used to see what happens. Equally, I'd be happy if Alexey is able to run my script to fragment and measure the performance of a couple of projects he's been working with. Based purely on the two packages I've tried this with, I can tell already that the results can vary wildly. My expectation is that Alexey's approach will be slower (at least in its current form, but probably more generally), but produce smaller output, but to what scale I have no idea.

My patch is at https://reviews.llvm.org/D74169. But I think it needs rebasing. I will rebase/update it in a couple of days.

I also would examine "Fragmented DWARF" patch this week to see the results(James, thank you for the sharing this work!). To compare apples to apples, I guess, D74169 approach should be used with ODR types de-duplication switched OFF. I would add that option into D74169(Though in that case it would be even more slower, but the resulting binary size should be closer to "Fragmented DWARF" results then).

Thank you, Alexey.

The script included in the patch can be used to convert an object containing normal DWARF into an object using fragmented DWARF. It does this by using llvm-dwarfdump to dump the various sections, parses the output to identify where it should split (using the offsets of the various entries), and then writes new section headers accordingly - you can see roughly what it's doing if you get a chance to watch the talk recording. The additional section headers are appended to the end of the ELF section header table, whilst the original DWARF is left in the same place it was before (making use of the fact that section headers don't have to appear in offset order). The script also parses and fragments the relocation sections targeting the DWARF sections so that they match up with the fragmented DWARF sections. This is clearly all suboptimal - in practice the compiler should be modified to do the fragmenting upfront, to save having to parse a tool's stdout, but that was just the simplest thing I could come up with to quickly write the script. Full details of the script usage are included in the patch description, if you want to play around with it.

If Alexey could point me at the latest version of his patch, I'd be happy to run that through either or both of the packages I used to see what happens. Equally, I'd be happy if Alexey is able to run my script to fragment and measure the performance of a couple of projects he's been working with. Based purely on the two packages I've tried this with, I can tell already that the results can vary wildly. My expectation is that Alexey's approach will be slower (at least in its current form, but probably more generally), but produce smaller output, but to what scale I have no idea.

James, I updated the patch - https://reviews.llvm.org/D74169.

To make it working it is necessary to build example with -ffunction-sections and specify following options to the linker :

--gc-sections --gc-debuginfo --gc-debuginfo-no-odr

For clang binary I got following results:

1. --gc-sections = binary size 1,5G, Debug Info size(*)1.2G

2. --gc-sections --gc-debuginfo = binary size 840M, 8x performance decrease, Debug Info size 542M

3. --gc-sections --gc-debuginfo --gc-debuginfo-no-odr = binary size 1,3G, 16x performance decrease, Debug Info size 1G

(*) .debug_info+.debug_str+.debug_line+.debug_ranges+.debug_loc

I added option --gc-debuginfo-no-odr, so that size reduction could be compared correctly. Without that option D74169 does types deduplication and then it is not correct to compare resulting size with "Fragmented DWARF" solution which does not do types deduplication.

Also, I look at your D89229 <https://reviews.llvm.org/D89229> and would share results some time later.

Thank you, Alexey.

Great, thanks Alexey! I’ll try to take a look at this in the near future, and will report my results back here. I imagine our clang results will differ, purely because we probably used different toolchains to build the input in the first place.

Hi Alexey,

I’ve just started looking at running your patch on the clang and game packages I used for the Fragmented DWARF experiment, and on both occasions, I got “warning: Generated debug info is broken” near the end of the link. Digging further, the actual error this represented (for the clang case) was “invalid e_shentsize in ELF header: 16912” (aside: there are several Expected instances around where the former warning was reported which are being thrown away and will cause assertions under the right configuration). I don’t really follow the code enough to understand whether this is a bug in the code or possibly some weird interaction with our downstream patches (I don’t expect the latter, for the clang build, as our patches are supposed to be a no-op when not using our target). I’ll check what happens with the clang package if I try using a completely vanilla LLVM with your patch applied.

I also got a large number of “no mapping for range” warnings when linking the game package. I tried debugging the code in the area, but the data types are all difficult to debug, and I don’t really understand the relevant area of code enough to be able to theorise what actually is causing this. llvm-dwarfdump --verify doesn’t flag up any issues, and there’s nothing obviously broken looking at the dump of the debug data either. Any pointers as to what might be going wrong would be appreciated. I assume with your experiments that you build with -ffunction-sections/-fdata-sections for maximum GC opportunities?

Thanks,

James

Hi James,

Thank you very much for the information.
According to the first problem: Could you send me a clang build configuration that you used so that I could reproduce the problem, please?

For the second problem: yes, I built the experiment with -ffunction-sections -fdata-sections.
According to the error message, it seems, that address ranges were read incorrectly.
As a quick guess, Could it be that incorrect address ranges are marked with -1/-2 value? Then they might be handled incorrectly, since this patch does not support(and was not tested) with LowPC>HighPC case. The simplest solution would be not to use -1/-2 values with this patch.

Thank you, Alexey.

Hi James,

I did experiments with the clang code base and will do experiments with our local codebase later.
Overall, both solutions(“Fragmented DWARF” and “DWARFLinker without odr types deduplication”) look having similar size savings results for the final binary. “DWARFLinker with odr types deduplication” has a bigger size saving effect. “Fragmented DWARF” increases the size of original object files up to 15%.
LLD with “fragmented DWARF” works significantly faster than with “DWARFLinker”.

Following are the results for “llvm-strings” and “clang” binaries:

  1. llvm-strings:

source object files size: 381M.
fragmented source object files size: 451M(18% increase).
`a. upstream version,` `command line options: --gc-sections` `binary size: 6,5M` `compilation time: 0:00.13 sec` `run-time memory: 111kb`
b. "fragmented DWARF" version,
command line options: --gc-sections --mark-live-pc=0.45
binary size: 3,7M
compilation time: 0:00.10 sec
run-time memory: 122kb
`c. DWARFLinker version,` `command line options: --gc-sections --gc-debuginfo` `binary size: 3,8M` `compilation time: 0:00.33 sec` `run-time memory: 141kb`
d. DWARFLinker no-odr version,
command line options: --gc-sections --gc-debuginfo --gc-debuginfo-no-odr
binary size: 4,3M
compilation time: 0:00.38 sec
run-time memory: 142kb

  1. clang:

source object files size: 6,5G.
fragmented source object files size: 7,3G(13% increase).
`a. upstream version,` `command line options: --gc-sections` `binary size: 1,5G` `compilation time: 6 sec` `run-time memory: 9.7G`
b. "fragmented DWARF" version,
command line options: --gc-sections --mark-live-pc=0.43
binary size: 1,1G
compilation time: 9 sec
run-time memory: 11G
`c. DWARFLinker version,` `command line options: --gc-sections --gc-debuginfo` `binary size: 836M` `compilation time: 62 sec` `run-time memory: 15G`
d. DWARFLinker no-odr version,
command line options: --gc-sections --gc-debuginfo --gc-debuginfo-no-odr
binary size: 1,3G
compilation time: 128 sec
run-time memory: 17G

Detailed size results:

1. llvm-strings

a)

FILE SIZE VM SIZE
-------------- --------------
41.1% 2.64Mi 0.0% 0 .debug_info
24.9% 1.60Mi 0.0% 0 .debug_str
12.6% 827Ki 0.0% 0 .debug_line
6.5% 428Ki 63.8% 428Ki .text
4.8% 317Ki 0.0% 0 .strtab
3.4% 223Ki 0.0% 0 .debug_ranges
2.0% 133Ki 19.8% 133Ki .eh_frame
1.7% 110Ki 0.0% 0 .symtab
1.2% 77.6Ki 0.0% 0 .debug_abbrev

b)
``
FILE SIZE VM SIZE
-------------- --------------
50.3% 1.85Mi 0.0% 0 .debug_info
43.6% 1.60Mi 0.0% 0 .debug_str
2.6% 98.2Ki 0.0% 0 .debug_line
2.1% 77.6Ki 0.0% 0 .debug_abbrev
0.5% 17.5Ki 54.9% 17.4Ki .text
0.3% 9.94Ki 0.0% 0 .strtab
0.2% 6.27Ki 0.0% 0 .symtab
0.1% 5.09Ki 15.9% 5.03Ki .eh_frame
0.1% 3.28Ki 0.0% 0 .debug_ranges

c)

FILE SIZE VM SIZE
-------------- --------------
33.0% 1.25Mi 0.0% 0 .debug_info
29.2% 1.11Mi 0.0% 0 .debug_str
11.0% 428Ki 63.8% 428Ki .text
8.2% 317Ki 0.0% 0 .strtab
7.8% 304Ki 0.0% 0 .debug_line
3.4% 133Ki 19.8% 133Ki .eh_frame
2.8% 110Ki 0.0% 0 .symtab
1.7% 65.9Ki 0.0% 0 .debug_ranges
1.0% 38.4Ki 5.7% 38.4Ki .rodata

d)

FILE SIZE VM SIZE
-------------- --------------
39.7% 1.68Mi 0.0% 0 .debug_info
26.3% 1.11Mi 0.0% 0 .debug_str
9.9% 428Ki 63.8% 428Ki .text
7.3% 317Ki 0.0% 0 .strtab
7.0% 304Ki 0.0% 0 .debug_line
3.1% 133Ki 19.8% 133Ki .eh_frame
2.6% 110Ki 0.0% 0 .symtab
1.5% 65.9Ki 0.0% 0 .debug_ranges

2. clang

a)

FILE SIZE VM SIZE
-------------- --------------
58.3% 878Mi 0.0% 0 .debug_info
11.8% 177Mi 0.0% 0 .debug_str
7.7% 115Mi 62.2% 115Mi .text
7.7% 115Mi 0.0% 0 .debug_line
6.0% 90.7Mi 0.0% 0 .strtab
2.4% 35.4Mi 0.0% 0 .debug_ranges
1.5% 23.3Mi 12.5% 23.3Mi .eh_frame
1.5% 23.0Mi 12.4% 23.0Mi .rodata
1.2% 17.9Mi 0.0% 0 .symtab

b)

FILE SIZE VM SIZE
-------------- --------------
71.5% 772Mi 0.0% 0 .debug_info
16.5% 177Mi 0.0% 0 .debug_str
3.7% 40.2Mi 59.2% 40.2Mi .text
2.4% 25.8Mi 0.0% 0 .debug_line
2.1% 23.0Mi 0.0% 0 .strtab
1.0% 10.6Mi 15.6% 10.6Mi .dynstr
0.7% 7.18Mi 10.6% 7.18Mi .eh_frame
0.5% 5.60Mi 0.0% 0 .symtab
0.4% 4.28Mi 0.0% 0 .debug_ranges
0.4% 4.04Mi 0.0% 0 .debug_abbrev

c)

FILE SIZE VM SIZE
-------------- --------------
35.1% 293Mi 0.0% 0 .debug_info
21.2% 177Mi 0.0% 0 .debug_str
13.9% 115Mi 62.2% 115Mi .text
10.9% 90.7Mi 0.0% 0 .strtab
6.9% 57.4Mi 0.0% 0 .debug_line
2.8% 23.3Mi 12.5% 23.3Mi .eh_frame
2.8% 23.0Mi 12.4% 23.0Mi .rodata
2.1% 17.9Mi 0.0% 0 .symtab
1.5% 12.4Mi 0.0% 0 .debug_ranges
1.3% 10.6Mi 5.7% 10.6Mi .dynstr

d)

FILE SIZE VM SIZE
-------------- --------------
58.3% 758Mi 0.0% 0 .debug_info
13.6% 177Mi 0.0% 0 .debug_str
8.9% 115Mi 62.2% 115Mi .text
7.0% 90.7Mi 0.0% 0 .strtab
4.4% 57.4Mi 0.0% 0 .debug_line
1.8% 23.3Mi 12.5% 23.3Mi .eh_frame
1.8% 23.0Mi 12.4% 23.0Mi .rodata
1.4% 17.9Mi 0.0% 0 .symtab
1.0% 12.4Mi 0.0% 0 .debug_ranges
0.8% 10.6Mi 5.7% 10.6Mi .dynstr

Thank you, Alexey.

Hi Alexey,

Thanks for taking a look at these. I noticed you set the --mark-live-pc value to a value other than 1 for the fragmented DWARF version. This will mean additional GC-ing will be done beyond the amount that --gc-sections will do, so unless you use the same value for the option for other versions, the result will not be comparable. (The option is purely there to experiment with the effects were different amounts of the input codebase to be considered dead). Would you be okay to run those figures again without the option specified?

I’m still trying to figure out the problems on my end to try running your experiment on the game package I used in my presentation, but have been interrupted by other unrelated issues. I’ll try to get back to this in the coming days.

James

Hi Alexey,

Thanks for taking a look at these. I noticed you set the --mark-live-pc value to a value other than 1 for the fragmented DWARF version. This will mean additional GC-ing will be done beyond the amount that --gc-sections will do, so unless you use the same value for the option for other versions, the result will not be comparable. (The option is purely there to experiment with the effects were different amounts of the input codebase to be considered dead). Would you be okay to run those figures again without the option specified?

Oh, mis-interpreted that option. Following are updated results:

1. llvm-strings:

source object files size: 381M\.
fragmented source object files size: 451M\(18% increase\)\.

a\. upstream version,
   command line options: \-\-gc\-sections
   binary size: 6,5M
   compilation time: 0:00\.13 sec
   run\-time memory: 111kb

b\. &quot;fragmented DWARF&quot; version,
   command line options: \-\-gc\-sections
   binary size: 5,3M
   compilation time: 0:00\.11 sec
   run\-time memory: 125kb

c\. DWARFLinker version,
   command line options: \-\-gc\-sections \-\-gc\-debuginfo
   binary size: 3,8M
   compilation time: 0:00\.33 sec
   run\-time memory: 141kb

d\. DWARFLinker no\-odr version,
   command line options: \-\-gc\-sections \-\-gc\-debuginfo \-\-gc\-debuginfo\-no\-odr
   binary size: 4,3M
   compilation time: 0:00\.38 sec
   run\-time memory: 142kb

2. clang:

source object files size: 6,5G\.
fragmented source object files size: 7,3G\(13% increase\)\.

a\. upstream version,
   command line options: \-\-gc\-sections
   binary size: 1,5G
   compilation time: 6 sec
   run\-time memory: 9\.7G

b\. &quot;fragmented DWARF&quot; version,
   command line options: \-\-gc\-sections
   binary size: 1,4G
   compilation time: 8 sec
   run\-time memory: 12G

c\. DWARFLinker version,
   command line options: \-\-gc\-sections \-\-gc\-debuginfo
   binary size: 836M
   compilation time: 62 sec
   run\-time memory: 15G

d\. DWARFLinker no\-odr version,
   command line options: \-\-gc\-sections \-\-gc\-debuginfo \-\-gc\-debuginfo\-no\-odr
   binary size: 1,3G
   compilation time: 128 sec
   run\-time memory: 17G

Detailed size results:

1. a)

 FILE SIZE        VM SIZE

-------------- --------------
41.1% 2.64Mi 0.0% 0 .debug_info
24.9% 1.60Mi 0.0% 0 .debug_str
12.6% 827Ki 0.0% 0 .debug_line
6.5% 428Ki 63.8% 428Ki .text
4.8% 317Ki 0.0% 0 .strtab
3.4% 223Ki 0.0% 0 .debug_ranges
2.0% 133Ki 19.8% 133Ki .eh_frame
1.7% 110Ki 0.0% 0 .symtab
1.2% 77.6Ki 0.0% 0 .debug_abbrev

b\)

 FILE SIZE        VM SIZE

-------------- --------------
40.2% 2.10Mi 0.0% 0 .debug_info
30.7% 1.60Mi 0.0% 0 .debug_str
8.0% 428Ki 63.8% 428Ki .text
5.9% 317Ki 0.0% 0 .strtab
5.9% 313Ki 0.0% 0 .debug_line
2.5% 133Ki 19.8% 133Ki .eh_frame
2.1% 110Ki 0.0% 0 .symtab
1.5% 77.6Ki 0.0% 0 .debug_abbrev
1.3% 69.2Ki 0.0% 0 .debug_ranges

c\)

 FILE SIZE        VM SIZE

-------------- --------------
33.0% 1.25Mi 0.0% 0 .debug_info
29.2% 1.11Mi 0.0% 0 .debug_str
11.0% 428Ki 63.8% 428Ki .text
8.2% 317Ki 0.0% 0 .strtab
7.8% 304Ki 0.0% 0 .debug_line
3.4% 133Ki 19.8% 133Ki .eh_frame
2.8% 110Ki 0.0% 0 .symtab
1.7% 65.9Ki 0.0% 0 .debug_ranges
1.0% 38.4Ki 5.7% 38.4Ki .rodata

d\)

    FILE SIZE        VM SIZE

-------------- --------------
39.7% 1.68Mi 0.0% 0 .debug_info
26.3% 1.11Mi 0.0% 0 .debug_str
9.9% 428Ki 63.8% 428Ki .text
7.3% 317Ki 0.0% 0 .strtab
7.0% 304Ki 0.0% 0 .debug_line
3.1% 133Ki 19.8% 133Ki .eh_frame
2.6% 110Ki 0.0% 0 .symtab
1.5% 65.9Ki 0.0% 0 .debug_ranges

2. a)

 FILE SIZE        VM SIZE

-------------- --------------
58.3% 878Mi 0.0% 0 .debug_info
11.8% 177Mi 0.0% 0 .debug_str
7.7% 115Mi 62.2% 115Mi .text
7.7% 115Mi 0.0% 0 .debug_line
6.0% 90.7Mi 0.0% 0 .strtab
2.4% 35.4Mi 0.0% 0 .debug_ranges
1.5% 23.3Mi 12.5% 23.3Mi .eh_frame
1.5% 23.0Mi 12.4% 23.0Mi .rodata
1.2% 17.9Mi 0.0% 0 .symtab

b\)

 FILE SIZE        VM SIZE

-------------- --------------
59.6% 807Mi 0.0% 0 .debug_info
13.1% 177Mi 0.0% 0 .debug_str
8.5% 115Mi 62.2% 115Mi .text
6.7% 90.7Mi 0.0% 0 .strtab
4.2% 57.4Mi 0.0% 0 .debug_line
1.7% 23.3Mi 12.5% 23.3Mi .eh_frame
1.7% 23.0Mi 12.4% 23.0Mi .rodata
1.3% 17.9Mi 0.0% 0 .symtab
1.0% 13.0Mi 0.0% 0 .debug_ranges
0.8% 10.6Mi 5.7% 10.6Mi .dynstr

c\)

 FILE SIZE        VM SIZE

-------------- --------------
35.1% 293Mi 0.0% 0 .debug_info
21.2% 177Mi 0.0% 0 .debug_str
13.9% 115Mi 62.2% 115Mi .text
10.9% 90.7Mi 0.0% 0 .strtab
6.9% 57.4Mi 0.0% 0 .debug_line
2.8% 23.3Mi 12.5% 23.3Mi .eh_frame
2.8% 23.0Mi 12.4% 23.0Mi .rodata
2.1% 17.9Mi 0.0% 0 .symtab
1.5% 12.4Mi 0.0% 0 .debug_ranges
1.3% 10.6Mi 5.7% 10.6Mi .dynstr

d\)

 FILE SIZE        VM SIZE

-------------- --------------
58.3% 758Mi 0.0% 0 .debug_info
13.6% 177Mi 0.0% 0 .debug_str
8.9% 115Mi 62.2% 115Mi .text
7.0% 90.7Mi 0.0% 0 .strtab
4.4% 57.4Mi 0.0% 0 .debug_line
1.8% 23.3Mi 12.5% 23.3Mi .eh_frame
1.8% 23.0Mi 12.4% 23.0Mi .rodata
1.4% 17.9Mi 0.0% 0 .symtab
1.0% 12.4Mi 0.0% 0 .debug_ranges
0.8% 10.6Mi 5.7% 10.6Mi .dynstr

Great, thanks! Those results are about roughly what I was expecting. I assume “compilation time” is actually just the link time?

I find it particularly interesting that the DWARFLinker rewriting solution produces the same size improvement in .debug_line as the fragmented DWARF approach. That suggests that in that case, fragmented DWARF output is probably about as optimal as it can get. I’m not surprised that the same can’t be said for other sections, but I’m also pleased to see that the full rewrite option isn’t so much better in size improvements.

Regarding the problems I was having with the patch, if you want to try reproducing the problems with clang, I built commit 05d02e5a of clang using gcc 7.5.0 on Ubuntu 18.04, to generate an ELF package. I then used LLD to relink it to create a reproducible package. As I’m primarily a Windows developer, I transferred this package to my Windows machine so that I could use my existing Windows checkout of LLVM, applied your patch, rebuilt LLD, and used that to try linking the package, getting the stated message. I’m going to have another try at the latter now to see if I can figure out what the issue is myself.

James

Hi Alexey,

Just an update - I identified the cause of the “Generated debug info is broken” error message when I tried to build things locally: the outStreamer instance is initialised with the host Triple, instead of whatever the target’s triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I’d expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).

I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:

Link-time speed (s):

(Resending with history trimmed to avoid it getting stuck in moderator queue).

Hi Alexey,

Just an update - I identified the cause of the “Generated debug info is broken” error message when I tried to build things locally: the outStreamer instance is initialised with the host Triple, instead of whatever the target’s triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I’d expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).

I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:

Link-time speed (s):

Great, thanks! Those results are about roughly what I was expecting. I assume "compilation time" is actually just the link time?

yep, that is link time.

Hi James,

(Resending with history trimmed to avoid it getting stuck in moderator queue).

Hi Alexey,

Just an update - I identified the cause of the “Generated debug info is broken” error message when I tried to build things locally: the outStreamer instance is initialised with the host Triple, instead of whatever the target’s triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I’d expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).

Thank you for the details. Actually, I did not test this on Windows. But I would do and update the patch.

I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:

Link-time speed (s):
±----------------------------±--------------+

Package variant | GC 1 (normal) |
±----------------------------±--------------+

Game (DWARF linker) | 53.6 |
Game (DWARF linker, no ODR) | 63.6 |

Clang (DWARF linker) | 200.6 |
±----------------------------±--------------+

Output size - Game package (MB):
±----------------------------±-----+

Category | GC 1 |

±----------------------------±-----+

DWARFLinker (total) | 696 |

DWARFLinker (DWARF*) | 429 |

DWARFLinker (other) | 267 |
DWARFLinker no ODR (total) | 753 |

DWARFLinker no ODR (DWARF*) | 485 |

DWARFLinker no ODR (other) | 268 |

±----------------------------±-----+

Output size - Clang (MB):
±----------------------------±-----+

Category | GC 1 |

±----------------------------±-----+

DWARFLinker (total) | 1294 |

DWARFLinker (DWARF*) | 743 |

DWARFLinker (other) | 551 |
DWARFLinker no ODR (total) | 1294 |

DWARFLinker no ODR (DWARF*) | 743 |

DWARFLinker no ODR (other) | 551 |

±----------------------------±-----+

*DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, .debug_ranges.

Peak Working Set Memory usage (GB):

±----------------------------±-----+

Package variant | GC 1 |

±----------------------------±-----+

Game (DWARFLinker) | 5.7 |
Game (DWARFLinker, no ODR) | 5.8 |

Clang (DWARFLinker) | 22.4 |

Clang (DWARFLinker, no ODR) | 22.5 |

±----------------------------±-----+

My opinion is that the time costs of the DWARF Linker approach are not really practical except on build servers, in the current state of affairs for larger packages: clang takes 8.8x as long as the fragmented approach and 11.2x as long as the plain approach (without the no ODR option). The size saving is certainly good, with my version of clang 51% of the total output size for the DWARF linker approach versus the plain approach and 55% of the fragmented approach (though it is likely that further size savings might be possible for the latter). The game produced reasonable size savings too: 62% and 74%, but I’d be surprised if these gains would be enough for people to want to use the approach in day-to-day situations, which presumably is the main use-case for smaller DWARF, due to improved debugger load times.

Interesting to note is that the GCC 7.5 build of clang I’ve used these figures with produced no difference in size results between the two variants, unlike other packages. Consequently, a significant amount of time is saved for no penalty.

I’ll be interested to see what the time results of the DWARF linker are once further improvements to it have been made.

yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison!

Speaking of “Fragmented DWARF” solution, how do you estimate memory requirements to support fragmented object files ? In comments for your Lightning Talk you have mentioned that it would be necessary to “update DebugInfo library to treat the fragmented sections as one continuous section”. Do you think it would be cheap to implement?

Thank you, Alexey.

Hi Alexey,

Hi James,

(Resending with history trimmed to avoid it getting stuck in moderator queue).

Hi Alexey,

Just an update - I identified the cause of the “Generated debug info is broken” error message when I tried to build things locally: the outStreamer instance is initialised with the host Triple, instead of whatever the target’s triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I’d expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).

Thank you for the details. Actually, I did not test this on Windows. But I would do and update the patch.

I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:

Link-time speed (s):
±----------------------------±--------------+

Package variant | GC 1 (normal) |
±----------------------------±--------------+

Game (DWARF linker) | 53.6 |
Game (DWARF linker, no ODR) | 63.6 |

Clang (DWARF linker) | 200.6 |
±----------------------------±--------------+

Output size - Game package (MB):
±----------------------------±-----+

Category | GC 1 |

±----------------------------±-----+

DWARFLinker (total) | 696 |

DWARFLinker (DWARF*) | 429 |

DWARFLinker (other) | 267 |
DWARFLinker no ODR (total) | 753 |

DWARFLinker no ODR (DWARF*) | 485 |

DWARFLinker no ODR (other) | 268 |

±----------------------------±-----+

Output size - Clang (MB):
±----------------------------±-----+

Category | GC 1 |

±----------------------------±-----+

DWARFLinker (total) | 1294 |

DWARFLinker (DWARF*) | 743 |

DWARFLinker (other) | 551 |
DWARFLinker no ODR (total) | 1294 |

DWARFLinker no ODR (DWARF*) | 743 |

DWARFLinker no ODR (other) | 551 |

±----------------------------±-----+

*DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, .debug_ranges.

Peak Working Set Memory usage (GB):

±----------------------------±-----+

Package variant | GC 1 |

±----------------------------±-----+

Game (DWARFLinker) | 5.7 |
Game (DWARFLinker, no ODR) | 5.8 |

Clang (DWARFLinker) | 22.4 |

Clang (DWARFLinker, no ODR) | 22.5 |

±----------------------------±-----+

My opinion is that the time costs of the DWARF Linker approach are not really practical except on build servers, in the current state of affairs for larger packages: clang takes 8.8x as long as the fragmented approach and 11.2x as long as the plain approach (without the no ODR option). The size saving is certainly good, with my version of clang 51% of the total output size for the DWARF linker approach versus the plain approach and 55% of the fragmented approach (though it is likely that further size savings might be possible for the latter). The game produced reasonable size savings too: 62% and 74%, but I’d be surprised if these gains would be enough for people to want to use the approach in day-to-day situations, which presumably is the main use-case for smaller DWARF, due to improved debugger load times.

Interesting to note is that the GCC 7.5 build of clang I’ve used these figures with produced no difference in size results between the two variants, unlike other packages. Consequently, a significant amount of time is saved for no penalty.

I’ll be interested to see what the time results of the DWARF linker are once further improvements to it have been made.

yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison!

No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it’s hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they’ll require more relocations.

Speaking of “Fragmented DWARF” solution, how do you estimate memory requirements to support fragmented object files ?

I’m not sure if you’re referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread. If it’s something else, please let me know. Based on those figures, it’s clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues.

In comments for your Lightning Talk you have mentioned that it would be necessary to “update DebugInfo library to treat the fragmented sections as one continuous section”. Do you think it would be cheap to implement?

I think so. I’d hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to “jump” to the next fragment (section) when it gets to the end of the previous one. I haven’t experimented with this, but I wouldn’t expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.

Hi Alexey,

Hi James,

(Resending with history trimmed to avoid it getting stuck in moderator queue).

Hi Alexey,

Just an update - I identified the cause of the "Generated debug info is broken" error message when I tried to build things locally: the `outStreamer` instance is initialised with the host Triple, instead of whatever the target's triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I'd expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).

   Thank you for the details. Actually, I did not test this on Windows. But I would do and update the patch.

I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:

Link-time speed (s):
+-----------------------------+---------------+
> Package variant | GC 1 (normal) |
+-----------------------------+---------------+
> Game (DWARF linker) | 53.6 |
> Game (DWARF linker, no ODR) | 63.6 |
> Clang (DWARF linker) | 200.6 |
+-----------------------------+---------------+

Output size - Game package (MB):
+-----------------------------+------+
> Category | GC 1 |
+-----------------------------+------+
> DWARFLinker (total) | 696 |
> DWARFLinker (DWARF*) | 429 |
> DWARFLinker (other) | 267 |
> DWARFLinker no ODR (total) | 753 |
> DWARFLinker no ODR (DWARF*) | 485 |
> DWARFLinker no ODR (other) | 268 |
+-----------------------------+------+

Output size - Clang (MB):
+-----------------------------+------+
> Category | GC 1 |
+-----------------------------+------+
> DWARFLinker (total) | 1294 |
> DWARFLinker (DWARF*) | 743 |
> DWARFLinker (other) | 551 |
> DWARFLinker no ODR (total) | 1294 |
> DWARFLinker no ODR (DWARF*) | 743 |
> DWARFLinker no ODR (other) | 551 |
+-----------------------------+------+

*DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, .debug_ranges.

Peak Working Set Memory usage (GB):
+-----------------------------+------+
> Package variant | GC 1 |
+-----------------------------+------+
> Game (DWARFLinker) | 5.7 |
> Game (DWARFLinker, no ODR) | 5.8 |
> Clang (DWARFLinker) | 22.4 |
> Clang (DWARFLinker, no ODR) | 22.5 |
+-----------------------------+------+

My opinion is that the time costs of the DWARF Linker approach are not really practical except on build servers, in the current state of affairs for larger packages: clang takes 8.8x as long as the fragmented approach and 11.2x as long as the plain approach (without the no ODR option). The size saving is certainly good, with my version of clang 51% of the total output size for the DWARF linker approach versus the plain approach and 55% of the fragmented approach (though it is likely that further size savings might be possible for the latter). The game produced reasonable size savings too: 62% and 74%, but I'd be surprised if these gains would be enough for people to want to use the approach in day-to-day situations, which presumably is the main use-case for smaller DWARF, due to improved debugger load times.

Interesting to note is that the GCC 7.5 build of clang I've used these figures with produced no difference in size results between the two variants, unlike other packages. Consequently, a significant amount of time is saved for no penalty.

I'll be interested to see what the time results of the DWARF linker are once further improvements to it have been made.

yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison!

No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it's hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they'll require more relocations.

Speaking of "Fragmented DWARF" solution, how do you estimate memory requirements to support fragmented object files ?

I'm not sure if you're referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread. If it's something else, please let me know. Based on those figures, it's clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues.

In comments for your Lightning Talk you have mentioned that it would be necessary to "update DebugInfo library to treat the fragmented sections as one continuous section". Do you think it would be cheap to implement?

I think so. I'd hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to "jump" to the next fragment (section) when it gets to the end of the previous one. I haven't experimented with this, but I wouldn't expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.

sizeof(InputSection) is 208 (sizeof(Elf64_Shdr)=64) so there is indeed
a significant overhead on fragmented segments.
A MergeInputSection can be split into SectionPiece, which is indeed
lightweight and MarkLive can mark liveness on these pieces. However,
in InputFiles.cpp we
change MergeInputSection to regular if it has a relocation
(toRegularSection). Using more lightweight data structures for
.debug_* fragments is still challenging.

Right, the overhead of additional sections is certainly a potential problem with Fragmented DWARF. I suspect this is where the majority of the cost compared to a “plain” link comes from. Indeed, I suspect it will get worse if I continue developing this concept, as I’ll need to switch to debug data fragments being parts of groups together with their corresponding function/data piece, which add yet more overhead. For other tools like llvm-dwarfdump, I doubt this cost is as significant, or at least isn’t as important, but I haven’t experimented with them to confirm. I’m currently considering ways to mitigate the section header overhead in the linker, inspired by how eh_frame and mergeable sections work in LLD. One idea I had, which might also help with the overhead of -ffunction-sections/-fdata-sections, was to have a separate section that indicated the split points, and then the linker would internally fragment the sections as dictated by this split point section. I haven’t explored this idea yet beyond that high-level concept, but it would at least save on I/O to some degree, if not memory cost.

Hi Alexey,

    Hi James,

    (Resending with history trimmed to avoid it getting stuck in
    moderator queue).

    Hi Alexey,

    Just an update - I identified the cause of the "Generated debug
    info is broken" error message when I tried to build things
    locally: the `outStreamer` instance is initialised with the host
    Triple, instead of whatever the target's triple is. For example,
    I build and run LLD on Windows, which means that a Windows triple
    will be generated, and consequently a COFF-emitting streamer will
    be created, rather than the ELF-emitting one I'd expect were the
    triple information to somehow be derived from the linker
    flavor/input objects etc. Hard-coding in my target triple
    resolved the issue (although I still got the other warnings
    mentioned from my game link).

     Thank you for the details. Actually, I did not test this on
    Windows. But I would do and update the patch.

    I measured the performance figures using LLD patched as
    described, and using the same methodology as my earlier results,
    and got the following:

    Link-time speed (s):
    +-----------------------------+---------------+
    > Package variant | GC 1 (normal) |
    +-----------------------------+---------------+
    > Game (DWARF linker) | 53.6 |
    > Game (DWARF linker, no ODR) | 63.6 |
    > Clang (DWARF linker) | 200.6 |
    +-----------------------------+---------------+

    Output size - Game package (MB):
    +-----------------------------+------+
    > Category | GC 1 |
    +-----------------------------+------+
    > DWARFLinker (total) | 696 |
    > DWARFLinker (DWARF*) | 429 |
    > DWARFLinker (other) | 267 |
    > DWARFLinker no ODR (total) | 753 |
    > DWARFLinker no ODR (DWARF*) | 485 |
    > DWARFLinker no ODR (other) | 268 |
    +-----------------------------+------+

    Output size - Clang (MB):
    +-----------------------------+------+
    > Category | GC 1 |
    +-----------------------------+------+
    > DWARFLinker (total) | 1294 |
    > DWARFLinker (DWARF*) | 743 |
    > DWARFLinker (other) | 551 |
    > DWARFLinker no ODR (total) | 1294 |
    > DWARFLinker no ODR (DWARF*) | 743 |
    > DWARFLinker no ODR (other) | 551 |
    +-----------------------------+------+

    *DWARF = just .debug_info, .debug_line, .debug_loc,
    .debug_aranges, .debug_ranges.

    Peak Working Set Memory usage (GB):
    +-----------------------------+------+
    > Package variant | GC 1 |
    +-----------------------------+------+
    > Game (DWARFLinker) | 5.7 |
    > Game (DWARFLinker, no ODR) | 5.8 |
    > Clang (DWARFLinker) | 22.4 |
    > Clang (DWARFLinker, no ODR) | 22.5 |
    +-----------------------------+------+

    My opinion is that the time costs of the DWARF Linker approach
    are not really practical except on build servers, in the current
    state of affairs for larger packages: clang takes 8.8x as long as
    the fragmented approach and 11.2x as long as the plain approach
    (without the no ODR option). The size saving is certainly good,
    with my version of clang 51% of the total output size for the
    DWARF linker approach versus the plain approach and 55% of the
    fragmented approach (though it is likely that further size
    savings might be possible for the latter). The game produced
    reasonable size savings too: 62% and 74%, but I'd be surprised if
    these gains would be enough for people to want to use the
    approach in day-to-day situations, which presumably is the main
    use-case for smaller DWARF, due to improved debugger load times.

    Interesting to note is that the GCC 7.5 build of clang I've used
    these figures with produced no difference in size results between
    the two variants, unlike other packages. Consequently, a
    significant amount of time is saved for no penalty.

    I'll be interested to see what the time results of the DWARF
    linker are once further improvements to it have been made.

    yep, current time costs of the DWARFLinker are too high. One of
    the reasons is that lld handles sections in parallel, while
    DWARFLinker handles data sequentially. Probably DWARFLinker
    numbers could be improved if it would be possible to teach it to
    handle data in parallel. Thank you for the comparison!

No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it's hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they'll require more relocations.

    Speaking of "Fragmented DWARF" solution, how do you estimate
    memory requirements to support fragmented object files ?

I'm not sure if you're referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread.

I mean the run-time memory usage of DebugInfoDWARF library.
Currently, when Object file is loaded and DWARFContext class is created
the DWARFContext references section data from object::ObjectFile:

DWARFContext(std::unique_ptr<const DWARFObject> DObj,..)

DWARFObjInMemory(const object::ObjectFile &Obj, ...)

class DWARFObjInMemory {
const DWARFSection &getLocSection() const;
const DWARFSection &getLoclistsSection() const;
StringRef getArangesSection() const;
const DWARFSection &getFrameSection() const;
const DWARFSection &getEHFrameSection() const;
const DWARFSection &getLineSection() const;
StringRef getLineStrSection() const;
}

class DWARFUnit {
DWARFContext &Context;
/// Section containing this DWARFUnit.
const DWARFSection &InfoSection;
}

struct DWARFSection {
StringRef Data;
};

DWARFSection references data that are loaded by Object file.
DWARFSection is assumed to be a monolithic piece of data.
There is a code using these data assuming random access:

StringRef LineData = OrigDwarf.getDWARFObj().getLineSection().Data;
LineData.slice(*StmtList + 4, PrologueEnd)
...
StringRef FrameData = OrigDwarf.getDWARFObj().getFrameSection().Data;
FrameData.substr(EntryOffset, InitialLength + 4)
...
InputSec = Dwarf.getDWARFObj().getLocSection();
InputSec.Data.substr(Offset, Length);
...
DWARFDataExtractor RangesData(Context.getDWARFObj(), *RangeSection,
isLittleEndian, getAddressByteSize());
uint64_t ActualRangeListOffset = RangeSectionBase + RangeListOffset;
RangeList.extract(RangesData, &ActualRangeListOffset);

i.e. It is possible to access random piece of DWARFSection.

If object::ObjectFile would contain fragmented sections then
we need a solution of how that could work.

One possibility is to create a glued copy of fragmented data and pass it to the DWARFObj.
But that would require to load all original debug info sections twice
(fragmented sections inside Objectfile and glued sections inside DWARFObj).

Another possibility is to rewrite DebugInfoDWARF/DWARFSection to avoid random access to the data(if that is possible).

If it's something else, please let me know. Based on those figures, it's clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues.

    In comments for your Lightning Talk you have mentioned that it
    would be necessary to "update DebugInfo library to treat the
    fragmented sections as one continuous section". Do you think it
    would be cheap to implement?

I think so. I'd hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to "jump" to the next fragment (section) when it gets to the end of the previous one. I haven't experimented with this, but I wouldn't expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.

So it looks like you assume the second case: avoiding random access to the section data.