[RFC] - Deduplication of debug information in linkers (LLD).

Wasn’t our (lld/ELF’s) position on debug info size that we should focus on providing a great split-dwarf workflow and not try go too far out of our way to deduplicate >or otherwise reduce debug info size inside LLD? I recall there being some patches that made linking of large debug binaries like 1.5GB+ clang faster, but we decided to >reject those changes because split-dwarf was the “right” solution.

Rafael, Rui?

(I even recall Rafael saying at one point that a great split-dwarf workflow was one of the key things he considered as necessary for him to consider LLD “done”)

– Sean Silva

The two features/directions don’t really compose - if the DWARF is split, then the linker never sees the DWARF (it’s not in the object files), so has no deduplication to do. (llvm-dwp might see it, so the deduplication can happen there)

​But could not we for example do split dwarf, but for example do dedup of types ?

I do not mean right now, but in a theory ?

Or following workflow:

Split dwarf is used to make linker to proccess less, like relocations, right ?

What about (I think I heard this somewhere, not sure idea is mine, it was a year ago I think).

What about something that combines output to a linker output, so it could do optimizations

of DWARF data (probably no need to proccess relocations at all, so it would be fast probably).

And combine everything to a single file.

Some kind of mix. Not sure it makes sence. Just wondering.

​But could not we for example do split dwarf, but for example do dedup of types ?

Yep, that is already supported - but the type deduplication happens in the DWP tool/generation, not in the linker. The linker inputs (.o files) don’t contain any of the type information, only the .dwo files contain the type information, etc, in a Fission build.

Or following workflow:

Split dwarf is used to make linker to proccess less, like relocations, right ?

Partly, though the main motivation as far as I know, was to have to provide fewer bytes to the linker at all. That’s why something like Apple’s scheme (leave the debug info in the object files, but have the linker ignore them - then merge the debug info separately in dsymutil) wasn’t applicable - because that still leaves large object files. For a distributed build system like Google’s, with very large binaries, the presence of the bytes, even if they’re ignored/not processed by the linker, was problematic.

What about (I think I heard this somewhere, not sure idea is mine, it was a year ago I think).

What about something that combines output to a linker output, so it could do optimizations

Not sure I understand what you’re suggesting here, sorry :confused:

Not sure I understand what you’re suggesting here, sorry :confused:

Ah, looks it is that:

“that’s why something like Apple’s scheme (leave the debug info in the object files, but have the linker ignore them -
then merge the debug info separately in dsymutil) wasn’t applicable”.

So instead of splitting debug info into separate files, info could be still included into linker output.
​But difference with current traditional flow is that linker would not process debug relocations
(because technically info can be (probably) close/the same as split debug has), what could (probably)
resolve issues with relocations processing (too large values). Could keep the single output file, but save time

that linker normally spends on resolving relocations. And allows to do things like deduplications inside linker
(what can be convinent). Though increases output size (not sure it is still a problem nowadays).

Have to run away now, sorry.

George.

Not sure I understand what you’re suggesting here, sorry :confused:

Ah, looks it is that:

“that’s why something like Apple’s scheme (leave the debug info in the object files, but have the linker ignore them -
then merge the debug info separately in dsymutil) wasn’t applicable”.

So instead of splitting debug info into separate files, info could be still included into linker output.
​>But difference with current traditional flow is that linker would not process debug relocations

(because technically info can be (probably) close/the same as split debug has), what could (probably)
resolve issues with relocations processing (too large values). Could keep the single output file, but save time

that linker normally spends on resolving relocations. And allows to do things like deduplications inside linker
(what can be convinent). Though increases output size (not sure it is still a problem nowadays).

Have to run away now, sorry.

George.

Ok. That above looks sound wierd. I think I just need to read more about how all DWARF things work then.
(particularry about split debug output details and dsymutil, though their logic looks transparent for me atm).
Or that brief description I mentioned I saw somewhere was dead
from start or it most likely it was just something else/different suggested, probably with involving
a compiler/debugger side changes too as well. Can only guess now :confused:

George.