[Proposal][Debuginfo] dsymutil-like tool for ELF.

Hi,

We propose llvm-dwarfutil - a dsymutil-like tool for ELF.
Any thoughts on this?
Thanks in advance, Alexey.

In principle, this sounds reasonable to me. I don’t know enough about dsymutil’s interface to know whether it makes sense to try to make it multi-format compatible or not. If it doesn’t I’m perfectly happy for a new tool to be added using the DWARFLinker library.

Some more general thoughts:

  1. Assuming the proposal is accepted, this should be introduced piecemeal into LLVM from the beginning as it is developed, rather than having a separate step 4 in the roadmap.
  2. The default tombstone values used for dead debug data should be those produced by LLD, in my opinion. In an ideal world, we’d factor them into some shared constant. Note that at the time of writing, I believe LLD is currently using BFD-style tombstones, not the new -1/-2.
  3. Does the DWARFLinker library already support multi-threading? If not, it might be a lot of work making things thread-safe.
  4. Given that DWARF v6 doesn’t exist yet, I wouldn’t include that as an option name just yet…!

Thanks for looking at this! Please keep me involved in any related reviews etc.

James

In principle, this sounds reasonable to me. I don't know enough about dsymutil's interface to know whether it makes sense to try to make it multi-format compatible or not. If it doesn't I'm perfectly happy for a new tool to be added using the DWARFLinker library.

Some more general thoughts:
1) Assuming the proposal is accepted, this should be introduced piecemeal into LLVM from the beginning as it is developed, rather than having a separate step 4 in the roadmap.
2) The default tombstone values used for dead debug data should be those produced by LLD, in my opinion. In an ideal world, we'd factor them into some shared constant. Note that at the time of writing, I believe LLD is currently using BFD-style tombstones, not the new -1/-2.

agreed.

3) Does the DWARFLinker library already support multi-threading? If not, it might be a lot of work making things thread-safe.

It does, but in a limited way. It can parallelize analyzing and cloning stages. i.e. the maximal speedup is two times.

To have a greater performance impact it could probably be parallelized per compilation unit basis.

Another thing is that dsymutil currently loads all DIEs from source object file into the memory. And releases them after object file is processed. For non-linked binary this works OK(big binaries usually compiled from several object files). For linked binary that means all DIEs are loaded into the memory. In the result it requires a lot of memory resources. The solution for this problem could be changing splitting of source data from the file to the compilation unit basis.

yes, making dsymutil/dwarfutil to work on compilation unit basis supporting multi-threading is a quite a big piece of work. It looks like it would be good for both dsymutil and dwarfutil.

4) Given that DWARF v6 doesn't exist yet, I wouldn't include that as an option name just yet...!

Would "maxpc" be OK? --tombstone=maxpc ?

Thanks for looking at this! Please keep me involved in any related reviews etc.

sure. Thank you for the comments.

Alexey.

Hey Alexey,

I haven’t had time to look at the corresponding patch yet, but I hope to do that soon. Here are my initial thoughts on the proposal.

Hi,

We propose llvm-dwarfutil - a dsymutil-like tool for ELF.
Any thoughts on this?
Thanks in advance, Alexey.

======================================================================

llvm-dwarfutil(Apndx A) - is a tool that is used for processing debug
info(DWARF)
located in built binary files to improve debug info quality,
reduce debug info size and accelerate debug info processing.
Supported object files formats: ELF, MachO(Apndx B), COFF(Apndx C),
WASM(Apndx C).

======================================================================

Specifically, the tool would do:

  • Remove obsolete debug info which refers to code deleted by the linker
    doing the garbage collection (gc-sections).

  • Deduplicate debug type definitions for reducing resulting size of
    binary.

  • Build accelerator/index tables.
    = .debug_aranges, .debug_names, .gdb_index, .debug_pubnames,
    .debug_pubtypes.

  • Strip unneeded tables.
    = .debug_aranges, .debug_names, .gdb_index, .debug_pubnames,
    .debug_pubtypes.

  • Compress or decompress debug info as requested.

Possible feature:

  • Join split dwarf .dwo files in a single file containing all debug info
    (convert split DWARF into monolithic DWARF).

======================================================================

User interface:

OVERVIEW: A tool for optimizing debug info located in the built binary.

USAGE: llvm-dwarfutil [options] input output

Nit: I would make the output a separate flag with -o for consistency with other similar tools.

OPTIONS: (Apndx E)

======================================================================

Implementation notes:

  1. Removing obsolete debug info would be done using DWARFLinker llvm
    library.

  2. Data types deduplication would be done using DWARFLinker llvm library.

  3. Accelerator/index tables would be generated using DWARFLinker llvm
    library.

This sounds reasonable to me. I think there is value in having all this in LLVM because LLD wants to use a subset of this functionality. If it weren’t for that I’d probably prefer to have this isolated to just the tool.

  1. Interface of DWARFLinker library would be changed in such way that it
    would be possible to switch on/off various stages:

class DWARFLinker {
setDoRemoveObsoleteInfo ( bool DoRemoveObsoleteInfo = false);

setDoAppleNames ( bool DoAppleNames = false );
setDoAppleNamespaces ( bool DoAppleNamespaces = false );
setDoAppleTypes ( bool DoAppleTypes = false );
setDoObjC ( bool DoObjC = false );
setDoDebugPubNames ( bool DoDebugPubNames = false );
setDoDebugPubTypes ( bool DoDebugPubTypes = false );

setDoDebugNames (bool DoDebugNames = false);
setDoGDBIndex (bool DoGDBIndex = false);
}

We can discuss this in the patch, but in dsymutil we pass LinkOption to the linker. I think that would work great for enabling certain functionality.

  1. Copying source file contents, stripping tables,
    compressing/decompressing tables
    would be done by ObjCopy llvm library(extracted from llvm-objcopy):

Error executeObjcopyOnBinary(const CopyConfig &Config,
object::COFFObjectFile &In, Buffer &Out);
Error executeObjcopyOnBinary(const CopyConfig &Config,
object::ELFObjectFileBase &In, Buffer &Out);
Error executeObjcopyOnBinary(const CopyConfig &Config,
object::MachOObjectFile &In, Buffer &Out);
Error executeObjcopyOnBinary(const CopyConfig &Config,
object::WasmObjectFile &In, Buffer &Out);

Just to make sure I understand this correctly. The current method names suggest that you’d be running objcopy as an external tool, but when implemented as a library you’d call the code in-process, right?

  1. Address ranges and single addresses pointing to removed code should
    be marked
    with tombstone value in the input file:

-2 for .debug_ranges and .debug_loc.
-1 for other .debug* tables.

  1. Prototype implementation - https://reviews.llvm.org/D86539.

======================================================================

Roadmap:

  1. Refactor llvm-objcopy to extract it`s implementation into separate
    library
    ObjCopy(in LLVM tree).

What exactly needs to be copied? In dsymutil we create a Mach-O companion file, which is really just a regular Mach-O with only the debug info sections in it. I think we do copy over a few segments, but we have to rewrite the load commands and obviously the DWARF sections. Which part of that would be handled by the objcopy library. It seems like this could be a first, standalone patch. Or do you only plan to use this for the ELF parts?

  1. Create a command line utility using existed DWARFLinker and ObjCopy
    implementation. First version is supposed to work with only ELF
    input object files.
    It would take input ELF file with unoptimized debug info and create
    output
    ELF file with optimized debug info. That version would be done out
    of the llvm tree.

I would prefer doing this incrementally in-tree. It will make reviewing these patches much easier and hopefully allow us to identify opportunities where we can improve both the ELF and the Mach-O variant.

  1. Make a tool to be able to work in multi-thread mode.

I’m a bit confused by what you mean here. The current DwarfLinker already does the analysis and cloning in parallel. As I’ve mentioned in the original thread, when I implemented this, there was no way to do better if you want to deduplicate across compilation units which is what gives the biggest size reduction.

  1. Consider it to be included into LLVM tree.

As I said before I’d rather see this developed incrementally in-tree.

  1. Support DWARF5 tables.

I assume you mean the line tables (and not the accelerator tables, i.e. debug names)?

======================================================================

Appendix A. Should this tool be implemented as a new tool or as an extension
to dsymutil/llvm-objcopy?

There already exists a tool which removes obsolete debug info on
darwin - dsymutil.
Why create another tool instead of extending the already existed
dsymutil/llvm-objcopy?

The main functionality of dsymutil is located in a separate library

  • DWARFLinker.
    Thus, dsymutil utility is a command-line interface for DWARFLinker.
    dsymutil has
    another type of input/output data: it takes several object files and
    address map
    as input and creates a .dSYM bundle with linked debug info as
    output. llvm-dwarfutil
    would take a built executable as input and create an optimized
    executable as output.
    Additionally, there would be many command-line options specific for
    only one utility.
    This means that these utilities(implementing command line interface)
    would significantly
    differ. It makes sense not to put another command-line utility
    inside existing dsymutil,
    but make it as a separate utility. That is the reason why
    llvm-dwarfutil suggested to be
    implemented not as sub-part of dsymutil but as a separate tool.

Please share your preference: whether llvm-dwarfutil should be
separate utility, or a variant of dsymutil compiled for ELF?

As the majority of the code has already been hoisted to LLVM for use in LLD, I think two separate tools are fine. I would prefer trying to share a common interface, I’m thinking mostly of the command line options. I’m not saying they should be a drop-in replacement for each other, but I’d be nice if we didn’t diverge on common functionality.

======================================================================

Appendix B. The machO object file format is already supported by dsymutil.
Depending on the decision whether llvm-dwarfutil would be done as a
subproject
of dsymutil or as a separate utility - machO would be supported or not.

I don’t think there’s any value in having the new tool support Mach-O. Things that could be shared should be hoisted into L

======================================================================

Appendix C. Support for the COFF and WASM object file formats presented as
possible future improvement. It would be quite easy to add them
assuming
that llvm-objcopy already supports these formats. It also would require
supporting DWARF6-suggested tombstone values(-1/-2).

======================================================================

Appendix D. Documentation.

======================================================================

Appendix E. Possible command line options:

DwarfUtil Options:

–build-aranges - generate .debug_aranges table.
–build-debug-names - generate .debug_names table.
–build-debug-pubnames - generate .debug_pubnames table.
–build-debug-pubtypes - generate .debug_pubtypes table.
–build-gdb-index - generate .gdb_index table.
–compress - Compress debug tables.
–decompress - Decompress debug tables.
–deduplicate-types - Do ODR deduplication for debug types.
–garbage-collect - Do garbage collecting for debug info.

This is of course up to you to decide, but as a potential user I might be worried about making all the functionality opt-in. For dsymutil you don’t have pass any options most of the time. Maybe it would be nice to have a set of defaults and the ability to -fenable or -fdisable them? Or having something like -debugger-tuning in clang?

–num-threads= - Specify the maximum number (n) of
simultaneous threads
to use when optimizing input file.
Defaults to the number of cores on the
current machine.

We can make j the default alias for this option. It’s supported by dsymutil but we kept the long option in the help output but I’m happy to change that.

–strip-all - Strip all debug tables.
–strip=<name1,name2> - Strip specified debug info tables.
–strip-unoptimized-debug - Strip all unoptimized debug tables.
–tombstone= - Tombstone value used as a marker of
invalid address.
=bfd - BFD default value
=dwarf6 - Dwarf v6.
–verbose - Enable verbose logging and encoding details.

Generic Options:

–help - Display available options (–help-hidden
for more)
–version - Display the version of this program

dsymutil also has a --verify option which runs the DWARF verifier on the output (I’m working on a patch to also run it on the input). It might be a nice addition to have this too down the road.

In principle, this sounds reasonable to me. I don’t know enough about dsymutil’s interface to know whether it makes sense to try to make it multi-format compatible or not. If it doesn’t I’m perfectly happy for a new tool to be added using the DWARFLinker library.

Some more general thoughts:

  1. Assuming the proposal is accepted, this should be introduced piecemeal into LLVM from the beginning as it is developed, rather than having a separate step 4 in the roadmap.
  2. The default tombstone values used for dead debug data should be those produced by LLD, in my opinion. In an ideal world, we’d factor them into some shared constant. Note that at the time of writing, I believe LLD is currently using BFD-style tombstones, not the new -1/-2.

agreed.

  1. Does the DWARFLinker library already support multi-threading? If not, it might be a lot of work making things thread-safe.

It does, but in a limited way. It can parallelize analyzing and cloning stages. i.e. the maximal speedup is two times.

To have a greater performance impact it could probably be parallelized per compilation unit basis.

I want to elaborate on this a bit as it’s been coming up several times now.
With the current design you cannot process CUs in parallel. There are two
reasons for that:

  1. When uniquing types, the first time a new type is encountered, it is marked
    as canonical. Every subsequent encounter of that type is replaced by a
    reference to the canonical type which is going to be coming from another CU.
    So to process CUs in parallel, you have to guarantee the reproducibility of
    what that canonical DIE is going to be, or postpone this to a sequential
    step, which would potentially defeat the purpose of the parallel analysis.

  2. During emission, we emit the offset of the canonical type in the output
    directly which allows us to stream out the DWARF. That means we need to know
    the offset when processing the places where the uniquing is removing the
    full type, which implies there’s a sequencing between cloning the CU with
    the canonical type and processing further uses. Without this property, we’d
    also need to keep all the output DIEs in memory until the offset can be
    computed, or we need to do another iteration to patch up the offsets.

I’m not saying this to discourage you, quite the opposite actually. I’d love to
be able to speed-up dsymutil. I’m just sharing this based on my experience when
adding the current concurrency which starts analyzing the next CU when we
finished processing the current one and are emitting it.

I’m sure we could design the uniquing algorithm in a way that would be able to
process CUs in parallel and gather the types, then synchronize to make a
uniquing decision, then clone all CUs in parallel and finally relocate all the
offsets to the canonical DIE. In the current algorithm it’s just not that
simple, because you don’t know whether a type is going to be kept in a
particular CU before having processed it completely.

In principle, this sounds reasonable to me. I don’t know enough about dsymutil’s interface to know whether it makes sense to try to make it multi-format compatible or not. If it doesn’t I’m perfectly happy for a new tool to be added using the DWARFLinker library.

Some more general thoughts:

  1. Assuming the proposal is accepted, this should be introduced piecemeal into LLVM from the beginning as it is developed, rather than having a separate step 4 in the roadmap.
  2. The default tombstone values used for dead debug data should be those produced by LLD, in my opinion. In an ideal world, we’d factor them into some shared constant. Note that at the time of writing, I believe LLD is currently using BFD-style tombstones, not the new -1/-2.

agreed.

  1. Does the DWARFLinker library already support multi-threading? If not, it might be a lot of work making things thread-safe.

It does, but in a limited way. It can parallelize analyzing and cloning stages. i.e. the maximal speedup is two times.

To have a greater performance impact it could probably be parallelized per compilation unit basis.

Another thing is that dsymutil currently loads all DIEs from source object file into the memory. And releases them after object file is processed. For non-linked binary this works OK(big binaries usually compiled from several object files). For linked binary that means all DIEs are loaded into the memory. In the result it requires a lot of memory resources. The solution for this problem could be changing splitting of source data from the file to the compilation unit basis.

yes, making dsymutil/dwarfutil to work on compilation unit basis supporting multi-threading is a quite a big piece of work. It looks like it would be good for both dsymutil and dwarfutil.

  1. Given that DWARF v6 doesn’t exist yet, I wouldn’t include that as an option name just yet…!

Would “maxpc” be OK? --tombstone=maxpc ?

“maxpc” sounds reasonable for an initial stab at a name. I’m sure there’s something better out there, but I can’t think of it, so no need to worry, if you don’t come up with anything better!

right. That is understood. I did not prepare detailed plan on this yet. Generally, I think, For the point 1 we need to have multi-thread version of DeclContext. Then at the first stage all CUs would be analyzed(in parallel) and canonical DIEs are determined. At second stage all CUs would be emitted(in parallel) into own container. Finally all CUs containers would be sequentially glued into resulting output and offsets referring to de-duplicated types should be changed to offset of canonical DIE (in the similar manner like it is currently done in CompileUnit::fixupForwardReferences()). It sounds pretty similar to your above description. It would be good if such “per CU” processing allows not only speed up execution time but minimize memory requirements. So that it would not be necessary to load all DIEs at a time. But that is not clear how to do yet. Because of DW_FORM_ref_addr attributes. Until we analyzed all CUs(to understand whether we need to keep referenced dies) we could not start cloning stage.

Hi Jonas, please find my comments below…

Ok, Let`s discuss this in the patch. objcopy could replace debug info sections. So the idea is to use objcopy functionality to copy original file without modifications except replacing debug info sections. i.e. specify new sections to objcopy config: CopyConfig.h StringMap NewDebugSections; add code to copy these sections : for (const auto &Sec : Config.NewDebugSections) { ArrayRef<uint8_t> DataBits((const uint8_t *)Sec.getValue().data(), Sec.getValue().size()); Section NewSection(DataBits); if (Config.CompressionType != DebugCompressionType::None) Obj.addSection(NewSection, Config.CompressionType); else Obj.addSection(NewSection); } Finally, it would be possible to call executeObjcopyOnBinary() and source file would be copied with replaced debug info sections: objectcopy::elf::executeObjcopyOnBinary(Config, InputFile, FB); Speaking of what should be moved from llvm-obcopy into ObjCopy library. It is Buffer.h, CopyConfig.h and entire ELF, MachO, WASM, COFF directories. It is done in the prototype(prototype copied only ELF part.) The external interface of that library would be described by : ELF/ELFObjcopy.h COFF/COFFObjcopy.h MachO/MachOObjcopy.h wasm/WasmObjcopy.h agreed.

If we’re designing a new tool and process, it would be wonderful if it did not require multiple stages of copying and slightly modifying the binary, in order to create final output with separate debug info. It seems to me that the variants of this sort of thing which exist today are somewhat suboptimal.

With Mach-O and dsymutil:

  1. Given a collection of object files (which contain debuginfo), link a binary with ld. The binary then includes special references to the object files that were actually used as part of the link.

  2. Given the linked binary, and all of the same object files, link the debuginfo with dsymutil.

  3. Strip the references to the object file paths from the binary.
    Finally, you have a binary without debug info, and a dsym debuginfo file. But it would be better if the binary created in step 1 didn’t need to include the extraneous object-file path info, and that was instead emitted in a second file. Then we wouldn’t need step 3.

With “normal” ELF:

  1. Given a collection of object files (which contain debuginfo), link a binary with ld, which includes linking all the debug info into the binary.

  2. Given the linked binary, objcopy --only-keep-debug to create a new separated debug file.

  3. Given the linked binary, objcopy --strip-debug to create a copy of the binary without debug info.
    Finally you have a binary without debug info, and a separate debug file. But it would be better if the linker could just write the debug info into a separate file in the first place, then we’d only have the one step. (But, downside, the linker needs to manage all the debug info, which can be excessively large.)

With “split-dwarf” ELF support:

  1. Given object files (which exclude most but not all of the debuginfo), link a binary. The binary will include that smaller set of debug info.

  2. Given the collection of dwo files corresponding to the object files, run the “dwp” tool to create a dwp file.

  3. objcopy --only-keep-debug

  4. –strip-debug
    And then you need to keep both a debug file and a dwp file, which is weird.

I think, ideally, users would have the following three good options:
Easy option: store debuginfo in the object files, and have the linker create a pair of {binary, separated dwarf-optimized debuginfo} files directly from the object files.
More scalable option: emit (most of the) debuginfo in separate *.dwo files using -gsplit-dwarf, and then,

  1. run the linker on the object files to create a pair of {binary, separated debuginfo} files. In this case the latter file contains the minimal debuginfo which was in the object files.
  2. run a second tool, which reads the minimal debuginfo from above, and all the DWO files, and creates a full optimized/deduplicated debuginfo output file.
    Faster developer builds: Like previous, but omit step 2 – running the debugger directly after step 1 can use the dwo files on-disk.

I think we’re not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? –

  1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.

  2. Integrate DWARFLinker into lld.

  3. Create a new tool which takes the separated debuginfo and DWO/DWP files and uses DWARFLinker library to create a new (dwarf-linked) separated-debug file, that doesn’t depend on DWO/DWP files.

My hope is that the tool you’re creating will be the implementation of #3, but I’m afraid the intent is for this tool to be an additional stage that non-split-dwarf users would need to run post-link, instead of integrating DWARFLinker into lld.

Hi James,

Thank you for the comments.

I think we’re not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? –

  1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.
  2. Integrate DWARFLinker into lld.
  3. Create a new tool which takes the separated debuginfo and DWO/DWP files and uses DWARFLinker library
    to create a new (dwarf-linked) separated-debug file, that doesn’t depend on DWO/DWP files.

The three goals which you`ve described are our far goals.
Indeed, the best solution would be to create valid optimized debug info without additional
stages and additional modifications of resulting binaries.

There was an attempt to use DWARFLinker from the lld - It did not receive enough support to be integrated yet. There are fair reasons for that: 1. Execution time. The time required by DWARFLinker for processing clang binary is 8x bigger than the usual linking time. Linking clang binary with DWARFLinker takes 72 sec, linking with the only lld takes 9 sec. 2. “Removing obsolete debug info” could not be switched off. Thus, lld could not use DWARFLinker for other tasks(like generation of index tables - .gdb_index, .debug_names) without significant performance degradation. 3. DWARFLinker does not support split dwarf at the moment. All these reasons are not blockers. And I believe implementation from D74169 might be integrated and incrementally improved if there would be agreement on that. Using DWARFLinker from llvm-dwarfutil is another possibility to use and improve it. When finally implemented - llvm-dwarfutil should solve the above three issues and there would probably be more reasons to include DWARFLinker into lld. Even if we would have the best solution - it is still useful to have a tool like llvm-dwarfutil for cases when it is necessary to process already created binaries. So in short, the suggested tool - llvm-dwarfutil - is a step towards the ideal solution. Its benefit is that it could be used until we created the best solution or for cases where “the best solution” is not applicable. Thank you, Alexey.

Hi James,

Thank you for the comments.

I think we're not terribly far from that ideal, now, for ELF. Maybe

only these three things need to be done? --

 1. Teach lld how to emit a separated debuginfo output file

directly, without requiring an objcopy step.

 2. Integrate DWARFLinker into lld.
 3. Create a new tool which takes the separated debuginfo and

DWO/DWP files and uses DWARFLinker library

to create a new (dwarf-linked) separated-debug file, that doesn't

depend on DWO/DWP files.

The three goals which you`ve described are our far goals.
Indeed, the best solution would be to create valid optimized debug info without additional
stages and additional modifications of resulting binaries.

There was an attempt to use DWARFLinker from the lld - ⚙ D74169 [WIP][LLD][ELF][DebugInfo] Remove obsolete debug info.
It did not receive enough support to be integrated yet. There are fair reasons for that:

1. Execution time. The time required by DWARFLinker for processing clang binary is 8x bigger
than the usual linking time. Linking clang binary with DWARFLinker takes 72 sec,
linking with the only lld takes 9 sec.

2. "Removing obsolete debug info" could not be switched off. Thus, lld could not use DWARFLinker for
other tasks(like generation of index tables - .gdb_index, .debug_names) without significant performance
degradation.

3. DWARFLinker does not support split dwarf at the moment.

All these reasons are not blockers. And I believe implementation from D74169 might be integrated and
incrementally improved if there would be agreement on that.

Using DWARFLinker from llvm-dwarfutil is another possibility to use and improve it.
When finally implemented - llvm-dwarfutil should solve the above three issues and there
would probably be more reasons to include DWARFLinker into lld.

Even if we would have the best solution - it is still useful to have a tool like llvm-dwarfutil
for cases when it is necessary to process already created binaries.

So in short, the suggested tool - llvm-dwarfutil - is a step towards the ideal solution.
Its benefit is that it could be used until we created the best solution or for cases
where "the best solution" is not applicable.

Thank you, Alexey.

If we're designing a new tool and process, it would be wonderful if it did not require multiple stages of copying and slightly modifying the binary, in order to create final output with separate debug info. It seems to me that the variants of this sort of thing which exist today are somewhat suboptimal.

With Mach-O and dsymutil:
 1. Given a collection of object files (which contain debuginfo), link a binary with ld. The binary then includes special references to the object files that were actually used as part of the link.
 2. Given the linked binary, and all of the same object files, link the debuginfo with dsymutil.
 3. Strip the references to the object file paths from the binary.
 Finally, you have a binary without debug info, and a dsym debuginfo file. But it would be better if the binary created in step 1 didn't need to include the extraneous object-file path info, and that was instead emitted in a second file. Then we wouldn't need step 3.

With "normal" ELF:
 1. Given a collection of object files (which contain debuginfo), link a binary with ld, which includes linking all the debug info into the binary.
 2. Given the linked binary, objcopy --only-keep-debug to create a new separated debug file.
 3. Given the linked binary, objcopy --strip-debug to create a copy of the binary without debug info.
 Finally you have a binary without debug info, and a separate debug file. But it would be better if the linker could just write the debug info into a separate file in the first place, then we'd only have the one step. (But, downside, the linker needs to manage all the debug info, which can be excessively large.)

With "split-dwarf" ELF support:
 1. Given object files (which exclude /most/ but not all of the debuginfo), link a binary. The binary will include that smaller set of debug info.
 2. Given the collection of dwo files corresponding to the object files, run the "dwp" tool to create a dwp file.
 3. objcopy --only-keep-debug
 4. --strip-debug
 And then you need to keep both a debug file /and/ a dwp file, which is weird.

I think, ideally, users would have the following three /good/ options:
 Easy option: store debuginfo in the object files, and have the linker create a pair of {binary, separated dwarf-optimized debuginfo} files directly from the object files.
 More scalable option: emit (most of the) debuginfo in separate *.dwo files using -gsplit-dwarf, and then,
  1. run the linker on the object files to create a pair of {binary, separated debuginfo} files. In this case the latter file contains the minimal debuginfo which was in the object files.
  2. run a second tool, which reads the minimal debuginfo from above, and all the DWO files, and creates a full optimized/deduplicated debuginfo output file.
 Faster developer builds: Like previous, but omit step 2 -- running the debugger directly after step 1 can use the dwo files on-disk.

I think we're not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? --
 1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.

This is very similar to Solaris's ancillary objects (ET_SUNW_ANCILLARY).
There are more details on Ancillary Objects: Separate Debug ELF Files For Solaris
In short, Solari's `ld -z ancillary[=outfile]` writes non-SHF_ALLOC sections to the
ancillary object. Perhaps we will need some coordination with GNU. Some
GNU folks are interested in a new object file type:
Redirecting to Google Groups

A debug file created by {,llvm-}objcopy --only-keep-debug has different
contents (see ⚙ D67137 [llvm-objcopy] Implement --only-keep-debug for ELF for details):
non-SHF_ALLOC sections and SHT_NOTE sections. Ancillary Objects: Separate Debug ELF Files For Solaris
does not say whether program headers are retained in the debug file, but
{,llvm-}objcopy --only-keep-debug keeps one copy (neither gdb/lldb needs
the program headers).

If we’re designing a new tool and process, it would be wonderful if it did not require multiple stages of copying and slightly modifying the binary, in order to create final output with separate debug info. It seems to me that the variants of this sort of thing which exist today are somewhat suboptimal.

With Mach-O and dsymutil:

  1. Given a collection of object files (which contain debuginfo), link a binary with ld. The binary then includes special references to the object files that were actually used as part of the link.

  2. Given the linked binary, and all of the same object files, link the debuginfo with dsymutil.

  3. Strip the references to the object file paths from the binary.
    Finally, you have a binary without debug info, and a dsym debuginfo file. But it would be better if the binary created in step 1 didn’t need to include the extraneous object-file path info, and that was instead emitted in a second file. Then we wouldn’t need step 3.

With “normal” ELF:

  1. Given a collection of object files (which contain debuginfo), link a binary with ld, which includes linking all the debug info into the binary.

  2. Given the linked binary, objcopy --only-keep-debug to create a new separated debug file.

  3. Given the linked binary, objcopy --strip-debug to create a copy of the binary without debug info.
    Finally you have a binary without debug info, and a separate debug file. But it would be better if the linker could just write the debug info into a separate file in the first place, then we’d only have the one step. (But, downside, the linker needs to manage all the debug info, which can be excessively large.)

With “split-dwarf” ELF support:

  1. Given object files (which exclude most but not all of the debuginfo), link a binary. The binary will include that smaller set of debug info.

  2. Given the collection of dwo files corresponding to the object files, run the “dwp” tool to create a dwp file.

  3. objcopy --only-keep-debug

  4. –strip-debug
    And then you need to keep both a debug file and a dwp file, which is weird.

I think, ideally, users would have the following three good options:
Easy option: store debuginfo in the object files, and have the linker create a pair of {binary, separated dwarf-optimized debuginfo} files directly from the object files.

(as discussed by other replies - that was an early proposal, didn’t gain a lot of traction/Eric & Ray weren’t super convinced it was worth adding to lld at this stage, given the link time cost & thus the small expected user base)

More scalable option: emit (most of the) debuginfo in separate *.dwo files using -gsplit-dwarf, and then,

  1. run the linker on the object files to create a pair of {binary, separated debuginfo} files. In this case the latter file contains the minimal debuginfo which was in the object files.

Yeah, that ^ is probably a nice feature regardless. Save folks an extra objcopy, etc. Usable right now for any build that is already running only-keep-debug/strip-debug.

  1. run a second tool, which reads the minimal debuginfo from above, and all the DWO files, and creates a full optimized/deduplicated debuginfo output file.

Fair - this then looks a lot like the MachO debug info distribution/linking model (with the advantage that the DWARF isn’t in the .o files, so doesn’t have to be shipped to the machine doing the linking), so far as I know.

Faster developer builds: Like previous, but omit step 2 – running the debugger directly after step 1 can use the dwo files on-disk.

I think we’re not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? –

  1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.

  2. Integrate DWARFLinker into lld.

  3. Create a new tool which takes the separated debuginfo and DWO/DWP files and uses DWARFLinker library to create a new (dwarf-linked) separated-debug file, that doesn’t depend on DWO/DWP files.

My hope is that the tool you’re creating will be the implementation of #3, but I’m afraid the intent is for this tool to be an additional stage that non-split-dwarf users would need to run post-link, instead of integrating DWARFLinker into lld.

Yeah, that’s the direction lld folks have pushed for - a post-processing, rather than link-time. Mostly due to the current performance of DWARF-aware linking being quite slow, so the idea that not many users would be willing to take that link-time performance hit to use the feature. (whereas as a post-processing step before archiving DWARF (like building a dwp from dwo files) it might be more appealing/interesting - and maybe with sufficient performance improvements, could then be rolled into lld as originally proposed)

Curiously Alexey’s needs include not wanting to use fission because a single debuggable binary simplifies his users use-case/makes it easier to distribute than two files. So he’s probably not interested in the strip-debug/only-keep-debug kind of debug info distribution model, at least for his own users/use case. So far as I understand it.

I’ve got mixed feelings about that - and encourage you to express/clarify/discuss your thoughts here, as I think the whole conversation could use some more voices.

  • Dave

A quick note: The feature as currently proposed sounds like it’s an exact match for ‘dwz’? Is there any benefit to this over the existing dwz project? Is it different in some ways I’m not aware of? (I haven’t actually used dwz, so I might have some mistaken ideas about how it should work)

If it’s going to solve the same general problem, but be in the llvm project instead, then maybe it should be called llvm-dwz.

Though I understand the desire for this to grow other functionality, like DWARF-aware dwp-ing. Might be better for this to busybox and provide that functionality under llvm-dwp instead, or more likely I Suspect, that the existing llvm-dwp will be rewritten (probably by me) to use more of lld’s infrastructure to be more efficient (it’s current object reading/writing logic is using LLVM’s libObject and MCStreamer, which is a bit inefficient for a very content-unaware linking process) and then maybe that could be taught to use DwarfLinker as a library to optionally do DWARF-aware linking depending on the users time/space tradeoff desires. Still benefiting from any improvements to the underlying DwarfLinker library (at which point that would be shared between llvm-dsymutil, llvm-dwz, and llvm-dwp).

What I meant that lld should emit the same files you’d get via objcopy --strip-debug; objcopy --only-keep-debug; objcopy --add-gnu-debuglink (or eu-strip -f foo.debug foo). Only difference is that it’s directly output from the linker, instead of via a post-processing step. Could be invoked like ld.lld -o foo -s --debug-output=foo.debug, or with -S, instead, if you want to keep the symtab in the binary instead of the debuginfo.

The original GNU proposal for the new object type flag in that thread was just a tiny modification of the existing formats, to enable identifying a debuginfo file. We can easily implement that extra flag, if it happens. It’s not clear to me that introducing some other new behavior here would be particularly interesting or useful – even having seen that thread.

It looks like dwz and llvm-dwarfutil are not exactly matched in functionality. dwz is a program that attempts to optimize DWARF debugging information contained in ELF shared libraries and ELF executables for size. llvm-dwarfutil is a tool that is used for processing debug info(DWARF) located in built binary files to improve debug info quality, reduce debug info size and accelerate debug info processing. Things which are supposed to be done by llvm-dwarfutil and which are not done by dwz: removing obsolete debug info, building indexes, stripping unneeded debug sections, compress/decompress debug sections. Common thing is that both of these tools do debug info size reduction. But they do this using different approaches: 1. dwz reduces the size of debug info by creating partial compilation units for duplicated parts. So that these partial compilation units could be imported in every duplicated place. AFAIU, That optimization gives the most size saving effect. another size saving optimization is ODR types deduplication. 2. llvm-dwarfutil reduces the size of debug info by ODR types deduplication which gives the most size saving effect in llvm-dwarfutil case. another size saving optimization is removing obsolete debug info. (which actually is not only about size but about correctness also) So, it looks like these tools are not equal. If we would consider that llvm-dwz is an extension of classic dwz then we could probably name it as llvm-dwz.

Fair enough - thanks for clarifying the differences! (I’d still lean a bit towards this being dwz-esque, as you say “an extension of classic dwz” using a bit more domain knowledge (of terminators and C++ odr - though I’m not sure dsymutil does rely on the ODR, does it? It relies on it to know that two names represent the same type, I suppose, but doesn’t assume they’re already identical, instead it merges their members))

But I don’t have super strong feelings about the naming.

Hi James,

Thank you for the comments.

I think we’re not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? –

  1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.
  2. Integrate DWARFLinker into lld.
  3. Create a new tool which takes the separated debuginfo and DWO/DWP files and uses DWARFLinker library
    to create a new (dwarf-linked) separated-debug file, that doesn’t depend on DWO/DWP files.

The three goals which you`ve described are our far goals.
Indeed, the best solution would be to create valid optimized debug info without additional
stages and additional modifications of resulting binaries.

There was an attempt to use DWARFLinker from the lld - https://reviews.llvm.org/D74169
It did not receive enough support to be integrated yet. There are fair reasons for that:

  1. Execution time. The time required by DWARFLinker for processing clang binary is 8x bigger
    than the usual linking time. Linking clang binary with DWARFLinker takes 72 sec,
    linking with the only lld takes 9 sec.

  2. “Removing obsolete debug info” could not be switched off. Thus, lld could not use DWARFLinker for
    other tasks(like generation of index tables - .gdb_index, .debug_names) without significant performance
    degradation.

  3. DWARFLinker does not support split dwarf at the moment.

All these reasons are not blockers. And I believe implementation from D74169 might be integrated and
incrementally improved if there would be agreement on that.

Those do sound like absolutely critical issues for deploying this for real – whether as a separate tool or integrated with lld. But possibly not critical enough to prevent adding this behind an experimental flag, and working on the code incrementally in-tree. However (without having looked at the code in question), I wonder if the reported 8x regression in link-time is even going to be salvageable just by incremental optimizations, or if it might require a complete re-architecting of the DwarfLinker code.

Using DWARFLinker from llvm-dwarfutil is another possibility to use and improve it.

When finally implemented - llvm-dwarfutil should solve the above three issues and there
would probably be more reasons to include DWARFLinker into lld.

Is it the case that if the code is built to support the “read an executable, output a new better executable” use-case, it will actually be what’s needed for the “output an optimized executable while linking object files” use-case? I worry that those could have enough different requirements that you really need to be developing the linker-integrated version from the very beginning in order to get a good result, rather than trying to shoehorn it in as an afterthought.

Even if we would have the best solution - it is still useful to have a tool like llvm-dwarfutil

for cases when it is necessary to process already created binaries.

Sure – I just think that should be considered as a secondary use-case, and not the primary goal.

Hi James,

Thank you for the comments.

I think we’re not terribly far from that ideal, now, for ELF. Maybe only these three things need to be done? –

  1. Teach lld how to emit a separated debuginfo output file directly, without requiring an objcopy step.
  2. Integrate DWARFLinker into lld.
  3. Create a new tool which takes the separated debuginfo and DWO/DWP files and uses DWARFLinker library
    to create a new (dwarf-linked) separated-debug file, that doesn’t depend on DWO/DWP files.

The three goals which you`ve described are our far goals.
Indeed, the best solution would be to create valid optimized debug info without additional
stages and additional modifications of resulting binaries.

There was an attempt to use DWARFLinker from the lld - https://reviews.llvm.org/D74169
It did not receive enough support to be integrated yet. There are fair reasons for that:

  1. Execution time. The time required by DWARFLinker for processing clang binary is 8x bigger
    than the usual linking time. Linking clang binary with DWARFLinker takes 72 sec,
    linking with the only lld takes 9 sec.

  2. “Removing obsolete debug info” could not be switched off. Thus, lld could not use DWARFLinker for
    other tasks(like generation of index tables - .gdb_index, .debug_names) without significant performance
    degradation.

  3. DWARFLinker does not support split dwarf at the moment.

All these reasons are not blockers. And I believe implementation from D74169 might be integrated and
incrementally improved if there would be agreement on that.

Those do sound like absolutely critical issues for deploying this for real – whether as a separate tool or integrated with lld. But possibly not critical enough to prevent adding this behind an experimental flag, and working on the code incrementally in-tree. However (without having looked at the code in question),

Yep, that’s my feeling too.

I wonder if the reported 8x regression in link-time is even going to be salvageable just by incremental optimizations, or if it might require a complete re-architecting of the DwarfLinker code.

Jonas, who’s looked at llvm-dsymutil performance for its own sake (motivated to improve llvm-dsymutil runtime, etc) & has mentioned on this/related threads that there might be minimal headroom to improve things there - so, yes, if there are greater opportunities it may require a fairly large/broad investment (though a second/third set of eyes on the current code to see if there are some hidden opportunities isn’t a bad thing).

Using DWARFLinker from llvm-dwarfutil is another possibility to use and improve it.

When finally implemented - llvm-dwarfutil should solve the above three issues and there
would probably be more reasons to include DWARFLinker into lld.

Is it the case that if the code is built to support the “read an executable, output a new better executable” use-case, it will actually be what’s needed for the “output an optimized executable while linking object files” use-case? I worry that those could have enough different requirements that you really need to be developing the linker-integrated version from the very beginning in order to get a good result, rather than trying to shoehorn it in as an afterthought.

Fair concern. I think there’s probably a good chance of a lot of overlap in functionality/benefits - but, yes, likely some unspecified amount that would be context-dependent/different between lld/dwz/dwp/dsymutil use cases that are all slightly different.

That would be more like complete re-architecturing of the DWARFLinker code. Current dsymutil implementation does “analyzing” and “cloning” stages in parallel. i.e. it sequentially analyzes all object files and sequentially clones them. Speed up ratio of parallelization is 2x. Changing this scenario to process compilation units in parallel might speed up execution time. Supporting that scenario would require huge refactoring. The advantage could be speedup execution time and reducing memory usage. I am planning to make a prototype of it to prove the fact that such a refactoring will have these benefits. There are already exist working prototypes of the “read an executable, output a new better executable” use case - . And “output an optimized executable while linking object files” . They share the most of DWARFLinker. The differences exist but they could be managed. The major problem of D86539 is that it loads all dies into the memory(since we have only one source file). For clang it requires approx 30G of memory. D74169 does not have such a problem since it loads/frees dies per object file. Changing processing from “per source file” into “per compilation unit” should help with this problem. So it looks like DWARFLinker refactored to parallel “per compilation unit” scenario fits quite well for both of these tasks. At current moment I do not see other problems which would prevent from using the same DWARFLinker library for both of these tasks. There could be unseen some, of course.

I doubt a little about “llvm-dwz” since it might confuse people who would expect exactly the same behavior. But if we think of it as “an extension of classic dwz” and the possible confusion is not a big deal then I would be fine with “llvm-dwz”.