Remove obsolete debug info while garbage collecting

Debuginfo and linker folks, we (AccessSoftek) would like to suggest a proposal for removing obsolete debug info. If you find it useful we will be happy to work on improving it. Thank you for any opinions and suggestions.

Alexey.

Currently when the linker does garbage collection a lot of abandoned debug info is left behind (see Appendix A for documentation). Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges. We propose removing debug info along with removing code. This would reduce debug info size and make sure debug info accuracy.

There are several approaches which could be used to solve that problem:

  1. Require dwarf producers to generate fragmented debug data according to DWARF5 specification: “E.3.3 Single-function-per-DWARF-compilation-unit” page 388. That approach assumes fragmenting the whole debug info per function basis and glue fragmented sections at the link time using section groups.

  2. Use an additional tool, which would optimize out unnecessary debug data, something similar to dwz (dwarf compressor tool), dsymutil (links the DWARF debug information). This approach assumes additional post-link binaries processing.

  3. Teach the linker to parse debug data and let it remove unused debug data.

In this proposal, we focus on approach #3. We show that this approach is viable and discuss some preliminary results, leaving particular implementation out of the scope. We attach the Proof of Concept (PoC) implementation(https://reviews.llvm.org/D67469) for illustrative purposes. Please keep in mind that it is not final, and there is room for improvements (see Appendix B). However, the achieved results look quite promising and demonstrate up to 2 times size reduction and performance overhead is 30% of linking time (which is in the same ballpark as the already done section compressing (see table 2 point F)).

A straightforward implementation would fully parse DWARF, create an in-memory hierarchy of DWARF objects, optimize them, and generate new sections content. Thus, it would require too much memory and would take noticeable time to process. Instead, the proposed solution is a combination of “fragmented DWARF” (#1 above) and “Optimise parsed DWARF at link stage” (#3 above). However, there is no preliminary DWARF data fragmentation step. Instead, the data is parsed at the link time, and then pieces, that correspond to live debug data, are copied into resulting sections. Essentially, the patch skips the debug info (subprograms, address ranges, line sentences) that corresponds to the dead sections.

Two command-line options are added to lld:

  1. –gc-debuginfo removes pieces of debug information related to the discarded sections.

  2. –gc-debuginfo-types does alternative type deduplication while doing --gc-debuginfo.

For the purpose of simplicity, some shortcuts were used in this PoC implementation:

  1. Same types always use the same abbreviations. Full implementation should take different abbreviations into account.
  2. Split DWARF is not supported.
  3. Only .debug_abbrev, .debug_info, .debug_ranges, .debug_rnglists, .debug_lines tables are processed.
  4. DWARF64 is not supported.

We also note that the proposed approach is quite universal and could be used for other debug info optimization tasks. F.e. there exists an alternative solution for data types deduplication (other than using COMDAT sections to keep types (-fdebug-types-section)): parse DWARF, cut out duplicated types, patch type references to point to the single type definition. I.e., it uses the same approach as used for deleting unused debug info - cut out unneeded debug section content. This alternative implementation should not necessarily replace -fdebug-types-section, but it shows that this approach could be used for the type deduplication as well. This solution (combined with the global type table, which is not implemented by this patch) has some advantages though. It could reduce the number of references inside .debug_info section. It could reduce the size of the type information by deduplicating base and DW_FORM_ref_sig8 types.

There are several things which would have been approved by the DWARF standard will help this implementation to work better:

  1. Minimize or entirely avoid references from subprograms into other parts of .debug_info section. That would simplify splitting and removing subprograms out in that sense that it would minimize the number of references that should be parsed and followed. (DW_FORM_ref_subroutine instead of DW_FORM_ref_*, ?)

  2. Create additional section - global types table (.debug_types_table). That would significantly reduce the number of references inside .debug_info section. It also makes it possible to have a 4-byte reference in this section instead of 8-bytes reference into type unit (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also makes it possible to place base types into this section and avoid per-compile unit duplication of them. Additionally, there could be achieved size reduction by not generating type unit header. Note, that new section - .debug_types_table - differs from DWARF4 section .debug_types in that sense that: it contains unique type descriptors referenced by offsets instead of list of type units referenced by DW_FORM_ref_sig8; all table entries share the same abbreviations and do not have type unit headers.

  3. Define the limited scope for line programs which could be removed independently. I.e. currently .debug_line section contains a program in byte-coded language for a state machine. That program actually represents a matrix [instruction][line information]. In general, it is hard to cut out part of that program and to keep the whole program correct. Thus it would be good to specify separate scopes (related to address ranges) which could be easily removed from the program body.

We evaluated the approach on LLVM and Clang codebases. The results obtained are summarized in the tables below:

Abbreviations:

LLVM bin size - size of llvm build/bin directory.
LLVM build time - compilation time for building llvm.
Clang size - size of clang binary.
link time - time for linking clang binary.
Errors - number of errors reported by llvm-dwarfdump --verify for clang binary.
gc-dbginfo - linker option added by this patch. Spelled as “-gc-debuginfo”.
gc-dbgtypes - linker option added by this patch. Spelled as “-gc-debuginfo-types”

Table 1. LLVM codebase.

Debuginfo and linker folks, we (AccessSoftek) would like to suggest a proposal for removing obsolete debug info. If you find it useful we will be happy to work on improving it. Thank you for any opinions and suggestions.

Alexey.

Currently when the linker does garbage collection a lot of abandoned debug info is left behind (see Appendix A for documentation). Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges. We propose removing debug info along with removing code. This would reduce debug info size and make sure debug info accuracy.

There are several approaches which could be used to solve that problem:

  1. Require dwarf producers to generate fragmented debug data according to DWARF5 specification: “E.3.3 Single-function-per-DWARF-compilation-unit” page 388. That approach assumes fragmenting the whole debug info per function basis and glue fragmented sections at the link time using section groups.

  2. Use an additional tool, which would optimize out unnecessary debug data, something similar to dwz (dwarf compressor tool), dsymutil (links the DWARF debug information). This approach assumes additional post-link binaries processing.

  3. Teach the linker to parse debug data and let it remove unused debug data.

In this proposal, we focus on approach #3. We show that this approach is viable and discuss some preliminary results, leaving particular implementation out of the scope. We attach the Proof of Concept (PoC) implementation(https://reviews.llvm.org/D67469) for illustrative purposes. Please keep in mind that it is not final, and there is room for improvements (see Appendix B). However, the achieved results look quite promising and demonstrate up to 2 times size reduction and performance overhead is 30% of linking time (which is in the same ballpark as the already done section compressing (see table 2 point F)).

Have you considered/tried reusing the DWARF minimization/deduplication/linking logic that’s already in llvm’s dsymutil implementation? If we’re going to do that having a singular implementation would be desirable.

(bonus points if we could do something like the dsymutil approach when using Split DWARF and building a DWP - taking some address table output from the linker, and using that to help trim things (or, even when having no input from the linker - at least doing more aggressive deduplication during DWP construction than can be currently done with only type units (& potentially removing/avoiding type unit overhead too))

A straightforward implementation would fully parse DWARF, create an in-memory hierarchy of DWARF objects, optimize them, and generate new sections content. Thus, it would require too much memory and would take noticeable time to process. Instead, the proposed solution is a combination of “fragmented DWARF” (#1 above) and “Optimise parsed DWARF at link stage” (#3 above). However, there is no preliminary DWARF data fragmentation step. Instead, the data is parsed at the link time, and then pieces, that correspond to live debug data, are copied into resulting sections. Essentially, the patch skips the debug info (subprograms, address ranges, line sentences) that corresponds to the dead sections.

Two command-line options are added to lld:

  1. –gc-debuginfo removes pieces of debug information related to the discarded sections.

  2. –gc-debuginfo-types does alternative type deduplication while doing --gc-debuginfo.

For the purpose of simplicity, some shortcuts were used in this PoC implementation:

  1. Same types always use the same abbreviations. Full implementation should take different abbreviations into account.
  2. Split DWARF is not supported.
  3. Only .debug_abbrev, .debug_info, .debug_ranges, .debug_rnglists, .debug_lines tables are processed.
  4. DWARF64 is not supported.

We also note that the proposed approach is quite universal and could be used for other debug info optimization tasks. F.e. there exists an alternative solution for data types deduplication (other than using COMDAT sections to keep types (-fdebug-types-section)): parse DWARF, cut out duplicated types, patch type references to point to the single type definition. I.e., it uses the same approach as used for deleting unused debug info - cut out unneeded debug section content. This alternative implementation should not necessarily replace -fdebug-types-section, but it shows that this approach could be used for the type deduplication as well. This solution (combined with the global type table, which is not implemented by this patch) has some advantages though. It could reduce the number of references inside .debug_info section. It could reduce the size of the type information by deduplicating base and DW_FORM_ref_sig8 types.

There are several things which would have been approved by the DWARF standard will help this implementation to work better:

  1. Minimize or entirely avoid references from subprograms into other parts of .debug_info section. That would simplify splitting and removing subprograms out in that sense that it would minimize the number of references that should be parsed and followed. (DW_FORM_ref_subroutine instead of DW_FORM_ref_*, ?)

Not sure I follow - by “other parts of the .debug_info section” do you mean in the same CU, or cross CU references? Any particular references you have in mind? Or encountered in practice?

  1. Create additional section - global types table (.debug_types_table). That would significantly reduce the number of references inside .debug_info section. It also makes it possible to have a 4-byte reference in this section instead of 8-bytes reference into type unit (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also makes it possible to place base types into this section and avoid per-compile unit duplication of them. Additionally, there could be achieved size reduction by not generating type unit header. Note, that new section - .debug_types_table - differs from DWARF4 section .debug_types in that sense that: it contains unique type descriptors referenced by offsets instead of list of type units referenced by DW_FORM_ref_sig8; all table entries share the same abbreviations and do not have type unit headers.

What do you mean when you say “global types table” the phrasing in the above paragraph is present-tense, as though this thing exists but doesn’t seem to describe what it actually is and how it achieves the things the text says it achieves. Perhaps I’ve missed some context here.

  1. Define the limited scope for line programs which could be removed independently. I.e. currently .debug_line section contains a program in byte-coded language for a state machine. That program actually represents a matrix [instruction][line information]. In general, it is hard to cut out part of that program and to keep the whole program correct. Thus it would be good to specify separate scopes (related to address ranges) which could be easily removed from the program body.

In my experience line tables are /tiny/ - have you prototyped any change in this space to have a sense of whether it would have significant savings? (it’d potentially help address the address ambiguity issues when the linker discards code, though - so might be a correctness issue rather than a size performance issue)

We evaluated the approach on LLVM and Clang codebases. The results obtained are summarized in the tables below:

Memory usage statistics (& confidence intervals for the build time) would probably be especially useful for comparing these tradeoffs.
Doubly so when using compression (since the decompression would need to use more memory, as would the recompression - so, two different tradeoffs (compressed input, compressed output, and then both at the same time))

Generally speaking, dsymutil does a very similar thing. It parses DWARF DIEs, analyzes relocations, scans through references and throws out unused DIEs. But it`s current interface does not allow to use it at link stage. I think it would be perfect to have a singular implementation. Though I did not analyze how easy or is it possible to reuse its code at the link stage, it looked like it needs a significant rework. Implementation from this proposal does removing of obsolete debug info at link stage. And so has benefits of already loaded object files, already created liveness information, generating an optimized binary from scratch. I mean here all kinds of references into .debug_info section. Going through references is the time-consuming task. Thus the fewer references there should be followed then the faster it works. For the cross CU references - It requires to load referenced CU. I do not know use cases where cross CU references are used. If that is the specific case and is not used inside subprograms usually, then probably it is possible to avoid it. For the same CU - there could probably be cases when references could be ignored:

17.09.2019 3:12, David Blaikie пишет:

Debuginfo and linker folks, we (AccessSoftek) would like to suggest a proposal for removing obsolete debug info. If you find it useful we will be happy to work on improving it. Thank you for any opinions and suggestions.

Alexey.

Currently when the linker does garbage collection a lot of abandoned debug info is left behind (see Appendix A for documentation). Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges. We propose removing debug info along with removing code. This would reduce debug info size and make sure debug info accuracy.

There are several approaches which could be used to solve that problem:

  1. Require dwarf producers to generate fragmented debug data according to DWARF5 specification: “E.3.3 Single-function-per-DWARF-compilation-unit” page 388. That approach assumes fragmenting the whole debug info per function basis and glue fragmented sections at the link time using section groups.

  2. Use an additional tool, which would optimize out unnecessary debug data, something similar to dwz (dwarf compressor tool), dsymutil (links the DWARF debug information). This approach assumes additional post-link binaries processing.

  3. Teach the linker to parse debug data and let it remove unused debug data.

In this proposal, we focus on approach #3. We show that this approach is viable and discuss some preliminary results, leaving particular implementation out of the scope. We attach the Proof of Concept (PoC) implementation(https://reviews.llvm.org/D67469) for illustrative purposes. Please keep in mind that it is not final, and there is room for improvements (see Appendix B). However, the achieved results look quite promising and demonstrate up to 2 times size reduction and performance overhead is 30% of linking time (which is in the same ballpark as the already done section compressing (see table 2 point F)).

Have you considered/tried reusing the DWARF minimization/deduplication/linking logic that’s already in llvm’s dsymutil implementation? If we’re going to do that having a singular implementation would be desirable.

(bonus points if we could do something like the dsymutil approach when using Split DWARF and building a DWP - taking some address table output from the linker, and using that to help trim things (or, even when having no input from the linker - at least doing more aggressive deduplication during DWP construction than can be currently done with only type units (& potentially removing/avoiding type unit overhead too))

Generally speaking, dsymutil does a very similar thing. It parses DWARF DIEs, analyzes relocations, scans through references and throws out unused DIEs. But it`s current interface does not allow to use it at link stage.
I think it would be perfect to have a singular implementation.
Though I did not analyze how easy or is it possible to reuse its code at the link stage, it looked like it needs a significant rework.

Implementation from this proposal does removing of obsolete debug info at link stage.
And so has benefits of already loaded object files, already created liveness information,
generating an optimized binary from scratch.

If dsymutil could be refactored in such manner that could be used at the link stage, then it`s implementation could be reused. I would research the possibility of such a refactoring.

Yeah, if this is going to be implemented, I think that would be strongly preferred - though I realize it may be substantial work to refactor. The alternative - duplicating all this work - doesn’t seem like something that would be good for the LLVM project.

  1. Minimize or entirely avoid references from subprograms into other parts of .debug_info section. That would simplify splitting and removing subprograms out in that sense that it would minimize the number of references that should be parsed and followed. (DW_FORM_ref_subroutine instead of DW_FORM_ref_*, ?)

Not sure I follow - by “other parts of the .debug_info section” do you mean in the same CU, or cross CU references? Any particular references you have in mind? Or encountered in practice?

I mean here all kinds of references into .debug_info section.

Ah, not only references from other places /into/ .debug_info (which don’t really exist, so far as I know) but any references to locations within debug_info.

Reducing these isn’t super-viable - types being the most common examples. Though now I understand what you’re getting at partly around the debug_type_table idea - adding a level of indirection to type references. So it’d be easy to find only one place to fix when removing chunks of debug_info (updating only the type table without having to find all the places inside debug_info to touch). That indirection would come at a size cost, of course - and an overhead for DWARF parsers having to follow that indirection. Doesn’t make it impossible - just tradeoffs to be aware of.

Though that’s not the only DIE references - without removing them all there’d still be a fair bit of overhead for finding any remaining ones and applying them. If an indirection table is to be added, maybe a generalized one (for any DIE reference) rather than one only for types would be good.

(aspects of this have been discusesd before - we’ve sometimes nicknamed it “bag of DWARF” when discussing it in the context of type units (currently you can only reference the type DIE in a type unit - which adds overhead when wanting to reference subprogram declaration DIEs, etc (or maybe multiple types are clustered together and don’t need a separate type unit each - if only you could refer to multiple types in a type unit) - so we’ve discussed generalizing the type unit header (actually it could generalize even as far as the classic CU header) to have N type DIE offset+hash pairs (zero for a normal CU, one for a classic type unit, and any number for more interesting cases))

Going through references is the time-consuming task.
Thus the fewer references there should be followed then the faster it works.

For the cross CU references - It requires to load referenced CU. I do not know use cases where cross CU references are used.

Cross-CU inlining due to LTO. Try something like this:

a.cpp:
void f2();
attribute((always_inline)) void f1() {
f2();
}

b.cpp:
void f1();
int main() {
f1();
}

$ clang++ a.cpp b.cpp -emit-llvm -S -c -g
$ llvm-link a.ll b.ll -o ab.bc
$ clang++ ab.bc -c
$ llvm-dwarfdump ab.o -v -debug-info |
0x0b: DW_TAG_compile_unit
DW_AT_name “a.cpp”
0x2a: DW_TAG_subprogram
DW_AT_abstract_origin [DW_FORM_ref4] (cu + 0x0056 => {0x00000056} “_Z2f1v”)
DW_TAG_subprogram
DW_AT_name “f1”
0x6e: DW_TAG_compile_unit
DW_AT_name “b.cpp”
0x8d: DW_TAG_subprogram
DW_AT_name “main”
0xa6: DW_TAG_inlined_subroutine
DW_AT_abstract_origin [DW_FORM_ref_addr] (0x0000000000000056 “_Z2f1v”)

ueaueoa
ueaoueoa

Notice that the inlined_subroutine’s abstract_origin uses a linker relocation into the debug_info section to give an absolute offset within the finally linked debug_info section (since the debugger wouldn’t know that these two compile_units are bound together and to use some particular compile_unit as the base offset - either it’s absolute across the whole debug_info section (FORM_ref_addr) or it’s local to the CU (FORM_refN (such as FORM_ref4 above)))

If that is the specific case and is not used inside subprograms usually, then probably it is possible to avoid it.

It’s fairly specifically used inside subprograms (& would need to be adjusted even if it wasn’t inside a subprogram - when bytes are removed, etc) - though possibly general relocation handling in the linker could be used to implement handling ref_addr.

For the same CU - there could probably be cases when references could be ignored: https://reviews.llvm.org/P8165

How would references be ignored while keeping them correct? Ah, by making subprograms more self-contained - maybe, but the work to figure out which things are only referenced from one place and structure the DWARF differently probably wouldn’t be ideal in the compiler & wouldn’t save the debug info linker from having to haev code to handle the case where it wasn’t only used from that subprogram anyway.

  1. Create additional section - global types table (.debug_types_table). That would significantly reduce the number of references inside .debug_info section. It also makes it possible to have a 4-byte reference in this section instead of 8-bytes reference into type unit (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also makes it possible to place base types into this section and avoid per-compile unit duplication of them. Additionally, there could be achieved size reduction by not generating type unit header. Note, that new section - .debug_types_table - differs from DWARF4 section .debug_types in that sense that: it contains unique type descriptors referenced by offsets instead of list of type units referenced by DW_FORM_ref_sig8; all table entries share the same abbreviations and do not have type unit headers.

What do you mean when you say “global types table” the phrasing in the above paragraph is present-tense, as though this thing exists but doesn’t seem to describe what it actually is and how it achieves the things the text says it achieves. Perhaps I’ve missed some context here.

The “global types table” does not exist yet. It could be created if the discussed approach would be considered useful.

Ah, the present-tense language was a bit confusing for me when discussing a thing that doesn’t exist yet & not having provided a description of what it might be or might contain and why it would exist/what it would achieve.

Please check the comparison of possible “global types table” and currently existed type units: https://reviews.llvm.org/P8164

Ah, that proposed version makes it easy to remove subprograms from debug_info without having to fix up type references (but you still have to have the code to fix up other cross-CU references, like abstract_origin, so I’m not sure it provides that much value) but doesn’t make it easy to remove types (becaues you’d have to go looking through the debug_info section to update all the type offsets (which I guess you have to do anyway to find the type references) and removing the types still also requires fixing up the types that reference each other…

So I’m not seeing a big win there.

The benefit of using “global types table” is that it saves the space required to keep types comparing with type units solution.

  1. Define the limited scope for line programs which could be removed independently. I.e. currently .debug_line section contains a program in byte-coded language for a state machine. That program actually represents a matrix [instruction][line information]. In general, it is hard to cut out part of that program and to keep the whole program correct. Thus it would be good to specify separate scopes (related to address ranges) which could be easily removed from the program body.

In my experience line tables are /tiny/ - have you prototyped any change in this space to have a sense of whether it would have significant savings? (it’d potentially help address the address ambiguity issues when the linker discards code, though - so might be a correctness issue rather than a size performance issue)

I did not measure the value of size reduction for line table, though I think that it would be a small value.
The more important thing is a correctness issue. Line table could contain information for overlapping address ranges.

There is another attempt to fix that issue - https://reviews.llvm.org/D59553.

Yep. It’s a complicated problem, and fixing the line table would be a good way to deal with some of it. (Split DWARF makes it hard to fix up the rest of the debug info, though - so there would still be some ambiguity in the DWARF with a binary using Split DWARF).

We evaluated the approach on LLVM and Clang codebases. The results obtained are summarized in the tables below:

Memory usage statistics (& confidence intervals for the build time) would probably be especially useful for comparing these tradeoffs.
Doubly so when using compression (since the decompression would need to use more memory, as would the recompression - so, two different tradeoffs (compressed input, compressed output, and then both at the same time))

I would measure memory impact for that PoC implementation, but I expect it would be significant.
Memory usage was not optimized yet. There are several things which might be done to reduce memory footprint:
do not load all compile units into memory, avoid adding Parent field to all DIEs.

Yep, this is the sort of thing where I suspect the dsymutil implementation may’ve already had at least some of that work done - or, if not, that doing the work once for both/all implementations would be very preferable to duplicating the effort.

  • Dave

19.09.2019 4:24, David Blaikie пишет:

    Generally speaking, dsymutil does a very similar thing. It parses
    DWARF DIEs, analyzes relocations, scans through references and
    throws out unused DIEs. But it`s current interface does not allow
    to use it at link stage.
     I think it would be perfect to have a singular implementation.
     Though I did not analyze how easy or is it possible to reuse its
    code at the link stage, it looked like it needs a significant rework.

     Implementation from this proposal does removing of obsolete debug
    info at link stage.
     And so has benefits of already loaded object files, already
    created liveness information,
     generating an optimized binary from scratch.

    If dsymutil could be refactored in such manner that could be used
    at the link stage, then it`s implementation could be reused. I
    would research the possibility of such a refactoring.

Yeah, if this is going to be implemented, I think that would be strongly preferred - though I realize it may be substantial work to refactor. The alternative - duplicating all this work - doesn't seem like something that would be good for the LLVM project.

I see. So I would research the question of whether it is possible to refactor it accordingly.

        1. Minimize or entirely avoid references from subprograms
        into other parts of .debug_info section. That would simplify
        splitting and removing subprograms out in that sense that it
        would minimize the number of references that should be parsed
        and followed. (DW_FORM_ref_subroutine instead of
        DW_FORM_ref_*, ?)

    Not sure I follow - by "other parts of the .debug_info section"
    do you mean in the same CU, or cross CU references? Any
    particular references you have in mind? Or encountered in practice?

    I mean here all kinds of references into .debug_info section.

Ah, not only references from other places /into/ .debug_info (which don't really exist, so far as I know) but any references to locations within debug_info.

Reducing these isn't super-viable - types being the most common examples. Though now I understand what you're getting at partly around the debug_type_table idea - adding a level of indirection to type references. So it'd be easy to find only one place to fix when removing chunks of debug_info (updating only the type table without having to find all the places inside debug_info to touch). That indirection would come at a size cost, of course - and an overhead for DWARF parsers having to follow that indirection. Doesn't make it impossible - just tradeoffs to be aware of.

Though that's not the only DIE references - without removing them all there'd still be a fair bit of overhead for finding any remaining ones and applying them. If an indirection table is to be added, maybe a generalized one (for any DIE reference) rather than one only for types would be good.

yes, some general indirection table would probably be useful.
But, types would still require specialized handling.
Types have "type hash" and need some specific logic around that.

(aspects of this have been discusesd before - we've sometimes nicknamed it "bag of DWARF" when discussing it in the context of type units (currently you can only reference the type DIE in a type unit - which adds overhead when wanting to reference subprogram declaration DIEs, etc (or maybe multiple types are clustered together and don't need a separate type unit each - if only you could refer to multiple types in a type unit) - so we've discussed generalizing the type unit header (actually it could generalize even as far as the classic CU header) to have N type DIE offset+hash pairs (zero for a normal CU, one for a classic type unit, and any number for more interesting cases))

As far as I understand, "generalizing the type unit header (actually it could generalize even as far as the classic CU header) to have N type DIE offset+hash pairs" looks very close to "global type table" which I am talking about.

    Going through references is the time-consuming task.
    Thus the fewer references there should be followed then the faster
    it works.

    For the cross CU references - It requires to load referenced CU. I
    do not know use cases where cross CU references are used.

Cross-CU inlining due to LTO. Try something like this:

a.cpp:
void f2();
__attribute__((always_inline)) void f1() {
f2();
}

b.cpp:
void f1();
int main() {
f1();
}

$ clang++ a.cpp b.cpp -emit-llvm -S -c -g
$ llvm-link a.ll b.ll -o ab.bc
$ clang++ ab.bc -c
$ llvm-dwarfdump ab.o -v -debug-info |
0x0b: DW_TAG_compile_unit
DW_AT_name "a.cpp"
0x2a: DW_TAG_subprogram
DW_AT_abstract_origin [DW_FORM_ref4] (cu + 0x0056 => {0x00000056} "_Z2f1v")
DW_TAG_subprogram
DW_AT_name "f1"
0x6e: DW_TAG_compile_unit
DW_AT_name "b.cpp"
0x8d: DW_TAG_subprogram
DW_AT_name "main"
0xa6: DW_TAG_inlined_subroutine
DW_AT_abstract_origin [DW_FORM_ref_addr] (0x0000000000000056 "_Z2f1v")

ueaueoa
ueaoueoa

Notice that the inlined_subroutine's abstract_origin uses a linker relocation into the debug_info section to give an absolute offset within the finally linked debug_info section (since the debugger wouldn't know that these two compile_units are bound together and to use some particular compile_unit as the base offset - either it's absolute across the whole debug_info section (FORM_ref_addr) or it's local to the CU (FORM_refN (such as FORM_ref4 above)))

Got it. Thank you.

    If that is the specific case and is not used inside subprograms
    usually, then probably it is possible to avoid it.

It's fairly specifically used inside subprograms (& would need to be adjusted even if it wasn't inside a subprogram - when bytes are removed, etc) - though possibly general relocation handling in the linker could be used to implement handling ref_addr.

    For the same CU - there could probably be cases when references
    could be ignored: https://reviews.llvm.org/P8165

How would references be ignored while keeping them correct? Ah, by making subprograms more self-contained - maybe, but the work to figure out which things are only referenced from one place and structure the DWARF differently probably wouldn't be ideal in the compiler & wouldn't save the debug info linker from having to haev code to handle the case where it wasn't only used from that subprogram anyway.

        2. Create additional section - global types table
        (.debug_types_table). That would significantly reduce the
        number of references inside .debug_info section. It also
        makes it possible to have a 4-byte reference in this section
        instead of 8-bytes reference into type unit
        (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also
        makes it possible to place base types into this section and
        avoid per-compile unit duplication of them. Additionally,
        there could be achieved size reduction by not generating type
        unit header. Note, that new section - .debug_types_table -
        differs from DWARF4 section .debug_types in that sense that:
        it contains unique type descriptors referenced by offsets
        instead of list of type units referenced by
        DW_FORM_ref_sig8; all table entries share the same
        abbreviations and do not have type unit headers.

    What do you mean when you say "global types table" the phrasing
    in the above paragraph is present-tense, as though this thing
    exists but doesn't seem to describe what it actually is and how
    it achieves the things the text says it achieves. Perhaps I've
    missed some context here.

    The "global types table" does not exist yet. It could be created
    if the discussed approach would be considered useful.

Ah, the present-tense language was a bit confusing for me when discussing a thing that doesn't exist yet & not having provided a description of what it might be or might contain and why it would exist/what it would achieve.

I should've written it more precise.

    Please check the comparison of possible "global types table" and
    currently existed type units: https://reviews.llvm.org/P8164

Ah, that proposed version makes it easy to remove subprograms from debug_info without having to fix up type references (but you still have to have the code to fix up other cross-CU references, like abstract_origin, so I'm not sure it provides that much value) but doesn't make it easy to remove types (becaues you'd have to go looking through the debug_info section to update all the type offsets (which I guess you have to do anyway to find the type references) and removing the types still also requires fixing up the types that reference each other...

So I'm not seeing a big win there.

Correct. Even if types were put into a separated table, there still would be necessary to:
"go looking through the debug_info section to update all the type offsets";
"removing the types still also requires fixing up the types that reference each other".

But additionally it allows to have following benefits:

1. Size reduction by remove fragmentation. In "-fdebug-types-section" solution every type which is put into type unit requires:
- additional type unit header,
- section header(since it put into separate section),
- proxy type copies inside compilation unit.

Putting types into separate table allows not to create above data for every type.

2. Size reduction by deduplicate base types. In "-fdebug-types-section" solution base types are not deduplicated at all.

3. Performance improvement by handling fewer data. #1 leads to loading and parsing fewer bits.

4. Performance improvement by handling fewer references. Simpler reference chains allow parsing references faster.
Instead of this :

type_offset->proxy_type->DW_FORM_ref_sig8->type_unit->type_offset->type.

There would be this :

type_offset->type_table->type.

        We evaluated the approach on LLVM and Clang codebases. The
        results obtained are summarized in the tables below:

    Memory usage statistics (& confidence intervals for the build
    time) would probably be especially useful for comparing these
    tradeoffs.
    Doubly so when using compression (since the decompression would
    need to use more memory, as would the recompression - so, two
    different tradeoffs (compressed input, compressed output, and
    then both at the same time))

    I would measure memory impact for that PoC implementation, but I
    expect it would be significant.
    Memory usage was not optimized yet. There are several things which
    might be done to reduce memory footprint:
    do not load all compile units into memory, avoid adding Parent
    field to all DIEs.

Yep, this is the sort of thing where I suspect the dsymutil implementation may've already had at least some of that work done - or, if not, that doing the work once for both/all implementations would be very preferable to duplicating the effort.

Ok,

Thank you, Alexey.

19.09.2019 4:24, David Blaikie пишет:

Generally speaking, dsymutil does a very similar thing. It parses DWARF DIEs, analyzes relocations, scans through references and throws out unused DIEs. But it`s current interface does not allow to use it at link stage.
I think it would be perfect to have a singular implementation.
Though I did not analyze how easy or is it possible to reuse its code at the link stage, it looked like it needs a significant rework.

Implementation from this proposal does removing of obsolete debug info at link stage.
And so has benefits of already loaded object files, already created liveness information,
generating an optimized binary from scratch.

If dsymutil could be refactored in such manner that could be used at the link stage, then it`s implementation could be reused. I would research the possibility of such a refactoring.

Yeah, if this is going to be implemented, I think that would be strongly preferred - though I realize it may be substantial work to refactor. The alternative - duplicating all this work - doesn’t seem like something that would be good for the LLVM project.

I see. So I would research the question of whether it is possible to refactor it accordingly.

  1. Minimize or entirely avoid references from subprograms into other parts of .debug_info section. That would simplify splitting and removing subprograms out in that sense that it would minimize the number of references that should be parsed and followed. (DW_FORM_ref_subroutine instead of DW_FORM_ref_*, ?)

Not sure I follow - by “other parts of the .debug_info section” do you mean in the same CU, or cross CU references? Any particular references you have in mind? Or encountered in practice?

I mean here all kinds of references into .debug_info section.

Ah, not only references from other places /into/ .debug_info (which don’t really exist, so far as I know) but any references to locations within debug_info.

Reducing these isn’t super-viable - types being the most common examples. Though now I understand what you’re getting at partly around the debug_type_table idea - adding a level of indirection to type references. So it’d be easy to find only one place to fix when removing chunks of debug_info (updating only the type table without having to find all the places inside debug_info to touch). That indirection would come at a size cost, of course - and an overhead for DWARF parsers having to follow that indirection. Doesn’t make it impossible - just tradeoffs to be aware of.

Though that’s not the only DIE references - without removing them all there’d still be a fair bit of overhead for finding any remaining ones and applying them. If an indirection table is to be added, maybe a generalized one (for any DIE reference) rather than one only for types would be good.

yes, some general indirection table would probably be useful.
But, types would still require specialized handling.
Types have “type hash” and need some specific logic around that.

This indirection is essentially the same as relocations & could be implemented that way (though no matter the solution you’d need some attribute on the CU that says “I don’t use any CU-local DIE offsets” so an implementation didn’t have to go searching/scanning for such offsets (though I guess it’d be cheap to scan for that by just looking at the abbreviations & if you don’t see any CU-local DIE offset forms, use the fast-path)). A custom DWARF format would be potentially more compact than general ELF relocations.

(aspects of this have been discusesd before - we’ve sometimes nicknamed it “bag of DWARF” when discussing it in the context of type units (currently you can only reference the type DIE in a type unit - which adds overhead when wanting to reference subprogram declaration DIEs, etc (or maybe multiple types are clustered together and don’t need a separate type unit each - if only you could refer to multiple types in a type unit) - so we’ve discussed generalizing the type unit header (actually it could generalize even as far as the classic CU header) to have N type DIE offset+hash pairs (zero for a normal CU, one for a classic type unit, and any number for more interesting cases))

As far as I understand, “generalizing the type unit header (actually it could generalize even as far as the classic CU header) to have N type DIE offset+hash pairs” looks very close to “global type table” which I am talking about.

Going through references is the time-consuming task.
Thus the fewer references there should be followed then the faster it works.

For the cross CU references - It requires to load referenced CU. I do not know use cases where cross CU references are used.

Cross-CU inlining due to LTO. Try something like this:

a.cpp:
void f2();
attribute((always_inline)) void f1() {
f2();
}

b.cpp:
void f1();
int main() {
f1();
}

$ clang++ a.cpp b.cpp -emit-llvm -S -c -g
$ llvm-link a.ll b.ll -o ab.bc
$ clang++ ab.bc -c
$ llvm-dwarfdump ab.o -v -debug-info |
0x0b: DW_TAG_compile_unit
DW_AT_name “a.cpp”
0x2a: DW_TAG_subprogram
DW_AT_abstract_origin [DW_FORM_ref4] (cu + 0x0056 => {0x00000056} “_Z2f1v”)
DW_TAG_subprogram
DW_AT_name “f1”
0x6e: DW_TAG_compile_unit
DW_AT_name “b.cpp”
0x8d: DW_TAG_subprogram
DW_AT_name “main”
0xa6: DW_TAG_inlined_subroutine
DW_AT_abstract_origin [DW_FORM_ref_addr] (0x0000000000000056 “_Z2f1v”)

ueaueoa
ueaoueoa

Notice that the inlined_subroutine’s abstract_origin uses a linker relocation into the debug_info section to give an absolute offset within the finally linked debug_info section (since the debugger wouldn’t know that these two compile_units are bound together and to use some particular compile_unit as the base offset - either it’s absolute across the whole debug_info section (FORM_ref_addr) or it’s local to the CU (FORM_refN (such as FORM_ref4 above)))

Got it. Thank you.

If that is the specific case and is not used inside subprograms usually, then probably it is possible to avoid it.

It’s fairly specifically used inside subprograms (& would need to be adjusted even if it wasn’t inside a subprogram - when bytes are removed, etc) - though possibly general relocation handling in the linker could be used to implement handling ref_addr.

For the same CU - there could probably be cases when references could be ignored: https://reviews.llvm.org/P8165

How would references be ignored while keeping them correct? Ah, by making subprograms more self-contained - maybe, but the work to figure out which things are only referenced from one place and structure the DWARF differently probably wouldn’t be ideal in the compiler & wouldn’t save the debug info linker from having to haev code to handle the case where it wasn’t only used from that subprogram anyway.

  1. Create additional section - global types table (.debug_types_table). That would significantly reduce the number of references inside .debug_info section. It also makes it possible to have a 4-byte reference in this section instead of 8-bytes reference into type unit (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also makes it possible to place base types into this section and avoid per-compile unit duplication of them. Additionally, there could be achieved size reduction by not generating type unit header. Note, that new section - .debug_types_table - differs from DWARF4 section .debug_types in that sense that: it contains unique type descriptors referenced by offsets instead of list of type units referenced by DW_FORM_ref_sig8; all table entries share the same abbreviations and do not have type unit headers.

What do you mean when you say “global types table” the phrasing in the above paragraph is present-tense, as though this thing exists but doesn’t seem to describe what it actually is and how it achieves the things the text says it achieves. Perhaps I’ve missed some context here.

The “global types table” does not exist yet. It could be created if the discussed approach would be considered useful.

Ah, the present-tense language was a bit confusing for me when discussing a thing that doesn’t exist yet & not having provided a description of what it might be or might contain and why it would exist/what it would achieve.

I should’ve written it more precise.

Please check the comparison of possible “global types table” and currently existed type units: https://reviews.llvm.org/P8164

Ah, that proposed version makes it easy to remove subprograms from debug_info without having to fix up type references (but you still have to have the code to fix up other cross-CU references, like abstract_origin, so I’m not sure it provides that much value) but doesn’t make it easy to remove types (becaues you’d have to go looking through the debug_info section to update all the type offsets (which I guess you have to do anyway to find the type references) and removing the types still also requires fixing up the types that reference each other…

So I’m not seeing a big win there.

Correct. Even if types were put into a separated table, there still would be necessary to:
“go looking through the debug_info section to update all the type offsets”;
“removing the types still also requires fixing up the types that reference each other”.

But additionally it allows to have following benefits:

  1. Size reduction by remove fragmentation. In “-fdebug-types-section” solution every type which is put into type unit requires:
  • additional type unit header,
  • section header(since it put into separate section),
  • proxy type copies inside compilation unit.

Putting types into separate table allows not to create above data for every type.

  1. Size reduction by deduplicate base types. In “-fdebug-types-section” solution base types are not deduplicated at all.

Base types are pretty small - not sure there’d be much to save by indirection (for classic base types like “int” - for non-trivial but non-user-defined types like subroutine types there might be more opportunity for savings). & you’d still have some cost of indirection to tradeoff - so I don’t think it’s always going to be the right solution to indirect everything.

There’s a lot of design considerations in this problem space, let’s put it that way.

  1. Performance improvement by handling fewer data. #1 leads to loading and parsing fewer bits.

  2. Performance improvement by handling fewer references. Simpler reference chains allow parsing references faster.
    Instead of this :

type_offset->proxy_type->DW_FORM_ref_sig8->type_unit->type_offset->type.

There would be this :

type_offset->type_table->type.

Yep, though to avoid the need for the proxy type you’d need to be able to refer to other entities in the “bag of DWARF”/generalized type unit (things like member function declarations and the like)

Yes, “bag of DWARF” or generalized type units (where you can refer to multiple entities in a single unit by some kind of hash) has some benefits.

But it seems somewhat orthogonal to your debug info linking goals here, unless it is a solution that removes the need for parsing the DWARF.

Another way to consider this would be to model (or actually implement) inter-DIE references as relocations (DW_FORM_sec_offset instead of a cu offset) - ah, I mentioned that earlier (I’m writing this reply out of order).

Hi Alexey,

Thank you for sharing this proposal. Reducing the size of debug info is generally a good thing, and I believe you’d see more debug info size reduction in Rust programs than in C++ programs, because I heard that the Rust compiler driver passes a lot of object files to the linker, expecting that the linker would remove most of them, which leaves dead debug info.

Debuginfo and linker folks, we (AccessSoftek) would like to suggest a proposal for removing obsolete debug info. If you find it useful we will be happy to work on improving it. Thank you for any opinions and suggestions.

Alexey.

Currently when the linker does garbage collection a lot of abandoned debug info is left behind (see Appendix A for documentation). Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges. We propose removing debug info along with removing code. This would reduce debug info size and make sure debug info accuracy.

There are several approaches which could be used to solve that problem:

  1. Require dwarf producers to generate fragmented debug data according to DWARF5 specification: “E.3.3 Single-function-per-DWARF-compilation-unit” page 388. That approach assumes fragmenting the whole debug info per function basis and glue fragmented sections at the link time using section groups.

  2. Use an additional tool, which would optimize out unnecessary debug data, something similar to dwz (dwarf compressor tool), dsymutil (links the DWARF debug information). This approach assumes additional post-link binaries processing.

  3. Teach the linker to parse debug data and let it remove unused debug data.

In this proposal, we focus on approach #3. We show that this approach is viable and discuss some preliminary results, leaving particular implementation out of the scope. We attach the Proof of Concept (PoC) implementation(https://reviews.llvm.org/D67469) for illustrative purposes. Please keep in mind that it is not final, and there is room for improvements (see Appendix B). However, the achieved results look quite promising and demonstrate up to 2 times size reduction and performance overhead is 30% of linking time (which is in the same ballpark as the already done section compressing (see table 2 point F)).

I believe #1 was added to DWARF5 to make link-time debug info GC possible, so could you tell me a little bit about why you chose to do #3? Is this because you want to do this for DWARF4?

24.09.2019 3:05, David Blaikie пишет:

    19.09.2019 4:24, David Blaikie пишет:

            1. Minimize or entirely avoid references from
            subprograms into other parts of .debug_info section.
            That would simplify splitting and removing subprograms
            out in that sense that it would minimize the number of
            references that should be parsed and followed.
            (DW_FORM_ref_subroutine instead of DW_FORM_ref_*, ?)

        Not sure I follow - by "other parts of the .debug_info
        section" do you mean in the same CU, or cross CU references?
        Any particular references you have in mind? Or encountered
        in practice?

        I mean here all kinds of references into .debug_info section.

    Ah, not only references from other places /into/ .debug_info
    (which don't really exist, so far as I know) but any references
    to locations within debug_info.

    Reducing these isn't super-viable - types being the most common
    examples. Though now I understand what you're getting at partly
    around the debug_type_table idea - adding a level of indirection
    to type references. So it'd be easy to find only one place to fix
    when removing chunks of debug_info (updating only the type table
    without having to find all the places inside debug_info to
    touch). That indirection would come at a size cost, of course -
    and an overhead for DWARF parsers having to follow that
    indirection. Doesn't make it impossible - just tradeoffs to be
    aware of.

    Though that's not the only DIE references - without removing them
    all there'd still be a fair bit of overhead for finding any
    remaining ones and applying them. If an indirection table is to
    be added, maybe a generalized one (for any DIE reference) rather
    than one only for types would be good.

    yes, some general indirection table would probably be useful.
    But, types would still require specialized handling.
    Types have "type hash" and need some specific logic around that.

This indirection is essentially the same as relocations & could be implemented that way (though no matter the solution you'd need some attribute on the CU that says "I don't use any CU-local DIE offsets" so an implementation didn't have to go searching/scanning for such offsets (though I guess it'd be cheap to scan for that by just looking at the abbreviations & if you don't see any CU-local DIE offset forms, use the fast-path)). A custom DWARF format would be potentially more compact than general ELF relocations.

I see, so indirection table(or just relocations) will speedup references patching. There would not be necessary to parse DWARF to find all references which should be corrected. They already would be gathered in the "indirection table"(or relocations table) and as the result patching process would be executed faster.

But that solution has a cost. You've already mentioned it. Size of debug info would be increased.

My original suggestion was to evaluate variant with a minimal size of debug info.
"Types table"/"bag of DWARF" allows us to have minimal size by deduplicating base/proxy types and avoiding fragmentation.
And if performance would be insufficient, then speed up it.
Indirection table is an option which would allow having that speedup.

            2. Create additional section - global types table
            (.debug_types_table). That would significantly reduce
            the number of references inside .debug_info section. It
            also makes it possible to have a 4-byte reference in
            this section instead of 8-bytes reference into type unit
            (DW_FORM_ref_types instead of DW_FORM_ref_sig8). It also
            makes it possible to place base types into this section
            and avoid per-compile unit duplication of them.
            Additionally, there could be achieved size reduction by
            not generating type unit header. Note, that new section
            - .debug_types_table - differs from DWARF4 section
            .debug_types in that sense that: it contains unique type
            descriptors referenced by offsets instead of list of
            type units referenced by DW_FORM_ref_sig8; all table
            entries share the same abbreviations and do not have
            type unit headers.

        What do you mean when you say "global types table" the
        phrasing in the above paragraph is present-tense, as though
        this thing exists but doesn't seem to describe what it
        actually is and how it achieves the things the text says it
        achieves. Perhaps I've missed some context here.

        The "global types table" does not exist yet. It could be
        created if the discussed approach would be considered useful.

    Ah, the present-tense language was a bit confusing for me when
    discussing a thing that doesn't exist yet & not having provided a
    description of what it might be or might contain and why it would
    exist/what it would achieve.

    I should've written it more precise.

        Please check the comparison of possible "global types table"
        and currently existed type units: https://reviews.llvm.org/P8164

    Ah, that proposed version makes it easy to remove subprograms
    from debug_info without having to fix up type references (but you
    still have to have the code to fix up other cross-CU references,
    like abstract_origin, so I'm not sure it provides that much
    value) but doesn't make it easy to remove types (becaues you'd
    have to go looking through the debug_info section to update all
    the type offsets (which I guess you have to do anyway to find the
    type references) and removing the types still also requires
    fixing up the types that reference each other...

    So I'm not seeing a big win there.

    Correct. Even if types were put into a separated table, there
    still would be necessary to:
     "go looking through the debug_info section to update all the type
    offsets";
     "removing the types still also requires fixing up the types that
    reference each other".

     But additionally it allows to have following benefits:

     1. Size reduction by remove fragmentation. In
    "-fdebug-types-section" solution every type which is put into
    type unit requires:
     - additional type unit header,
     - section header(since it put into separate section),
     - proxy type copies inside compilation unit.

     Putting types into separate table allows not to create above
    data for every type.

    2. Size reduction by deduplicate base types. In
    "-fdebug-types-section" solution base types are not deduplicated
    at all.

Base types are pretty small - not sure there'd be much to save by indirection (for classic base types like "int" - for non-trivial but non-user-defined types like subroutine types there might be more opportunity for savings). & you'd still have some cost of indirection to tradeoff - so I don't think it's always going to be the right solution to indirect everything.

There's a lot of design considerations in this problem space, let's put it that way.

For the clang binary they(base/proxy types) take ~1.5% of overall .debug_info + .debug_types. Another ~1.5% takes fragmentation from #1.

Implementing "bag of DWARF"/"generalized type unit"/"types table" allows to deduplicate base/proxy types and avoid fragmentation.
It could give approx 3% of debug info for either reducing debug info size either creating "indirection table" accelerator.

My idea is to start from the minimum size of debug info and to check whether parsing performance would be enough. The PoC implementation for that proposal does all kind of things: parses abbreviations, removes parts of debug_info, searches for references which should be patched, patch references. Its performance looks quite good.

    3. Performance improvement by handling fewer data. #1 leads to
    loading and parsing fewer bits.

    4. Performance improvement by handling fewer references. Simpler
    reference chains allow parsing references faster.
     Instead of this :

    type_offset->proxy_type->DW_FORM_ref_sig8->type_unit->type_offset->type.

     There would be this :

     type_offset->type_table->type.

Yep, though to avoid the need for the proxy type you'd need to be able to refer to other entities in the "bag of DWARF"/generalized type unit (things like member function declarations and the like)

Yes, "bag of DWARF" or generalized type units (where you can refer to multiple entities in a single unit by some kind of hash) has some benefits.

But it seems somewhat orthogonal to your debug info linking goals here, unless it is a solution that removes the need for parsing the DWARF.

Minimizing of debug_info size is also a goal If the performance of parsing DWARF would be acceptable.

Another way to consider this would be to model (or actually implement) inter-DIE references as relocations (DW_FORM_sec_offset instead of a cu offset) - ah, I mentioned that earlier (I'm writing this reply out of order).

Agreed.

Alexey

24.09.2019 8:26, Rui Ueyama пишет:

Hi Alexey,

Thank you for sharing this proposal. Reducing the size of debug info is generally a good thing, and I believe you'd see more debug info size reduction in Rust programs than in C++ programs, because I heard that the Rust compiler driver passes a lot of object files to the linker, expecting that the linker would remove most of them, which leaves dead debug info.

Hi Rui, Thanks!

    Debuginfo and linker folks, we (AccessSoftek) would like to
    suggest a proposal for removing obsolete debug info. If you find
    it useful we will be happy to work on improving it. Thank you for
    any opinions and suggestions.

    Alexey.

     Currently when the linker does garbage collection a lot of
    abandoned debug info is left behind (see Appendix A for
    documentation). Besides inflated debug info size, we ended up with
    overlapping address ranges and no way to say valid vs garbage
    ranges. We propose removing debug info along with removing code.
    This would reduce debug info size and make sure debug info accuracy.

    There are several approaches which could be used to solve that
    problem:

    1. Require dwarf producers to generate fragmented debug data
    according to DWARF5 specification: "E.3.3
    Single-function-per-DWARF-compilation-unit" page 388. That
    approach assumes fragmenting the whole debug info per function
    basis and glue fragmented sections at the link time using section
    groups.

    2. Use an additional tool, which would optimize out unnecessary
    debug data, something similar to dwz (dwarf compressor tool),
    dsymutil (links the DWARF debug information). This approach
    assumes additional post-link binaries processing.

    3. Teach the linker to parse debug data and let it remove unused
    debug data.

    In this proposal, we focus on approach #3. We show that this
    approach is viable and discuss some preliminary results, leaving
    particular implementation out of the scope. We attach the Proof of
    Concept (PoC) implementation(https://reviews.llvm.org/D67469) for
    illustrative purposes. Please keep in mind that it is not final,
    and there is room for improvements (see Appendix B). However, the
    achieved results look quite promising and demonstrate up to 2
    times size reduction and performance overhead is 30% of linking
    time (which is in the same ballpark as the already done section
    compressing (see table 2 point F)).

I believe #1 was added to DWARF5 to make link-time debug info GC possible, so could you tell me a little bit about why you chose to do #3? Is this because you want to do this for DWARF4?

No, that proposal is not DWARF-4 specific. The proposal is for DWARF-5 also. The solution added to DWARF-5("E.3.3 Single-function-per-DWARF-compilation-unit" page 388.) is not a complete solution. This is a recommendation which needs to have an additional specification.
There is -fdebug-types-section implementation which follows that recommendation. Other cases(other than type units) do not easily fit into this recommendation. There are tables which have a common header. F.e. .debug_line, .debug_rnglists, .debug_addr. It is not clear how these tables could be separated between section groups.

The more important thing is the fragmentation itself. Dividing debug tables into pieces would increase debug info size.
It also would significantly complicate code working with debug info. F.e. include/llvm/DebugInfo/DWARF/DWARFObject.h has interface for class DWARFObject. It currently is not ready for the case when there could be multiple tables. Patch introducing support for multiple tables would be massive change affected many places in llvm codebase.

Another thing is that not only the llvm code base but all other DWARF consumers should be changed to support fragmented debug info.

Shortly, if all debug tables would be fragmented then working with debug info would be significantly complicated.

Thus the reasons to select #3 are :

1. It could be done in a single place, not affecting other parts of the llvm code base.
2. It does not require other DWARF consumers to implement support for it.
3. Avoiding fragmentation would save space.
4. Processing of not fragmented debug info is faster.
5. No need to adapt DWARF tables for fragmentation. They could be handled with their current state.

Alexey

Alexay,

Thank you for the detailed explanation. The other question I have is, as discussed above, about dsymutil. You said that dsymutil is not usable at link-time. What does that mean? If we only have to emit an output file in the usual way and then automatically invoke dsymutils on the file that the linker just created, that’s easy to do, and lld and dsymutil can live in the same process so that you can keep the linker being not depend on an external command.

Alexay,

Thank you for the detailed explanation. The other question I have is, as discussed above, about dsymutil. You said that dsymutil is not usable at link-time. What does that mean? If we only have to emit an output file in the usual way and then automatically invoke dsymutils on the file that the linker just created, that’s easy to do, and lld and dsymutil can live in the same process so that you can keep the linker being not depend on an external command.

dsymutil isn’t really (to my knowledge) setup for that sort of operation at the moment - it’s currently very tied to the Apple/OSX/MachO debug info distribution model (it’s for creating dsym debug info bundles from a set of object files and an output of addresses from the linker).

If it was generalized as a post-processing step, that would be good for archival purposes (reducing the size of debug info in binaries in the long-term) but wouldn’t address what are probably the more significant drawbacks for some users (including Google) - the sheer number of bytes copied from input to output during linking - reducing the amount of linker output written in the first place would be significantly beneficial. (though I do think/hope dsymutil’s implementation could be adapted/generalized to be used in this situation - and I do have concerns that doing such non-trivial work at link time might not be a great tradeoff because the complexity and memory usage might be more than the savings, though I’ve no certainty one way or the other there)

25.09.2019 9:21, Rui Ueyama пишет:

Alexay,

Thank you for the detailed explanation. The other question I have is, as discussed above, about dsymutil. You said that dsymutil is not usable at link-time. What does that mean? If we only have to emit an output file in the usual way and then automatically invoke dsymutils on the file that the linker just created, that's easy to do, and lld and dsymutil can live in the same process so that you can keep the linker being not depend on an external command.

Calling dsymutil as a separate invocation would have significant overhead. Because linker would create a not optimized binary, later dsymutil would load it again, parse, write optimized info out. Thus, there would be additional disk space usage and performance loss. When I said that dsymutil is not usable at
link-time, I meant a more tight integration case. It is not possible to reuse the implementation of dsymutil inside linker at the current moment. It is not possible to say(somewhere after markLive<ELFT>()) :

DwarfLinker.link(DebugMap) // DwarfLinker is a dsymutil class

and have the benefit of already loaded object files, already created DebugMap and the stripped debug info sections by DwarfLinker.link(), which later would be written out by lld.

Presented PoC implementation, in contrast, uses benefits of already loaded object files, created liveness information, reduced output binary. It allows achieving minimal disk usage and maximal execution speed.

25.09.2019 18:49, David Blaikie пишет:

    Alexay,

    Thank you for the detailed explanation. The other question I have
    is, as discussed above, about dsymutil. You said that dsymutil is
    not usable at link-time. What does that mean? If we only have to
    emit an output file in the usual way and then automatically invoke
    dsymutils on the file that the linker just created, that's easy to
    do, and lld and dsymutil can live in the same process so that you
    can keep the linker being not depend on an external command.

dsymutil isn't really (to my knowledge) setup for that sort of operation at the moment - it's currently very tied to the Apple/OSX/MachO debug info distribution model (it's for creating dsym debug info bundles from a set of object files and an output of addresses from the linker).

If it was generalized as a post-processing step, that would be good for archival purposes (reducing the size of debug info in binaries in the long-term) but wouldn't address what are probably the more significant drawbacks for some users (including Google) - the sheer number of bytes copied from input to output during linking - reducing the amount of linker output written in the first place would be significantly beneficial.

I would like to note that PoC implementation does exactly this. it reduces number of bytes copied from input to output during linking, It reduces the amount of linker output.

Additionally, I measured memory usage of PoC implementation. Following table shows memory usage for linking clang :

Alexey,

I’m a bit worried to teach lld about DWARF, as this is something we’ve been carefully avoid to do. Linkers are mostly agnostic about the contents of sections. Sections are basically just bags of bytes, and linkers generally don’t attempt to parse their contents. That being said, we’ve already taught lld how to parse (some part of) DWARF to implement --gdb-index and other features, and because of the nature of DWARF file format it is unavoidable. So it may be OK to add more code for DWARF dedup, if the additional complexity is not too much, and the new code is nicely isolated from existing code. I think I agree with you that linker is perhaps the best place to drop dead DWARF info. Let me start code review to see how the code works. Thanks!

24.09.2019 3:05, David Blaikie пишет:

    19.09.2019 4:24, David Blaikie пишет:

        Generally speaking, dsymutil does a very similar thing. It
        parses DWARF DIEs, analyzes relocations, scans through
        references and throws out unused DIEs. But it`s current
        interface does not allow to use it at link stage.
         I think it would be perfect to have a singular implementation.
         Though I did not analyze how easy or is it possible to reuse
        its code at the link stage, it looked like it needs a
        significant rework.

         Implementation from this proposal does removing of obsolete
        debug info at link stage.
         And so has benefits of already loaded object files, already
        created liveness information,
         generating an optimized binary from scratch.

        If dsymutil could be refactored in such manner that could be
        used at the link stage, then it`s implementation could be
        reused. I would research the possibility of such a refactoring.

    Yeah, if this is going to be implemented, I think that would be
    strongly preferred - though I realize it may be substantial work
    to refactor. The alternative - duplicating all this work -
    doesn't seem like something that would be good for the LLVM project.

    I see. So I would research the question of whether it is possible
    to refactor it accordingly.

It looks like I have a prototype of that refactoring. Next I am going to create a patch from it.

Thank you, Alexey.

27.09.2019 11:46, Rui Ueyama пишет:

Alexey,

I'm a bit worried to teach lld about DWARF, as this is something we've been carefully avoid to do. Linkers are mostly agnostic about the contents of sections. Sections are basically just bags of bytes, and linkers generally don't attempt to parse their contents. That being said, we've already taught lld how to parse (some part of) DWARF to implement --gdb-index and other features, and because of the nature of DWARF file format it is unavoidable. So it may be OK to add more code for DWARF dedup, if the additional complexity is not too much, and the new code is nicely isolated from existing code. I think I agree with you that linker is perhaps the best place to drop dead DWARF info. Let me start code review to see how the code works. Thanks!

Ok, Thank you. I started to refactor dsymutil to have a possibility to use it inside linker.
After that I am going to start to work on linker part of this.

Thank you, Aexey.