Abstract:
This RFC proposes adding a new DWARF attribute, DW_AT_LLVM_stmt_sequence
, for each function. This attribute aims to resolve the loss of line table data in functions that have been merged during optimization processes like linker ICF. It will ensure that line table entries can always be distinctly linked to their respective functions.
Background:
Relevant Pull request: [DebugInfo] Add flag to enable function-level debug line attribution
In collaboration with Greg Clayton, while integrating merged function information into GSYM, we encountered instances where line table data was lost in the .dSYM due to the current representation in DWARF. Typically, each compile unit possesses a unique line table in a .o file before linking. Post-optimization, such as linker ICF, functions might merge within a compile unit, leading multiple functions overlapping at an address. Consequently, when relocations are applied to the line table, it could result in overlapping sequences describing the same address range. This overlap complicates symbolication, as DWARF consumers might retrieve a line entry from an incorrect sequence. When translating DWARF to GSYM, querying the LLVM line table for address-matching rows might return incorrect data, as it always fetches the first matching sequence for a particular address.
Proposal:
To ensure accurate retrieval of line table data for each function in DWARF, we propose adding a new attribute, DW_AT_LLVM_stmt_sequence
. This attribute would store the offset within a line table that uniquely identifies the start of a line table sequence that has the line table solely for this function. For instance, if two inlined functions are merged, this attribute would help pinpoint the correct line table entries for each function.
Consider the following line table structure:
.debug_line:
0x00000000: prologue
0x00000030: sequence[0] for function "foo"
0x1000: foo.h:20
0x1010: foo.h:21
0x1020: foo.h:21 end_sequence
0x00000080: sequence[1] for function "bar"
0x1000: bar.h:44
0x1010: bar.h:45
0x1020: bar.h:45 end_sequence
Here, both sequences describe the range [0x1000-0x1020). By using the DW_AT_LLVM_stmt_sequence attribute, we can uniquely associate a line table with its function:
0x00000032: DW_TAG_subprogram
DW_AT_low_pc (0x1000)
DW_AT_high_pc (0x1020)
DW_AT_name ("foo")
DW_AT_LLVM_stmt_sequence (0x00000030)
0x00000094: DW_TAG_subprogram
DW_AT_low_pc (0x1000)
DW_AT_high_pc (0x1020)
DW_AT_name ("bar")
DW_AT_LLVM_stmt_sequence (0x00000080)
Advantages:
- Significantly enhances the speed of symbolication for individual addresses by eliminating the need to parse the entire line table.
- Accurately retrieves line tables for merged functions.
- Maintains backward compatibility; if unknown, the DW_AT_stmt_sequence does not alter the existing DWARF format.
- Does not require modifications to the .debug_line encoding for ICF/merged functions.
Disadvantages:
- Each function would require its own sequence, whereas currently, consecutive functions often share a sequence.
- More relocations will be generated in the .debug_line section now that each function requires its own line sequence where one of the first items will be an address which requires relocation. Prior to this many functions might share one line sequence where there wasn’t a relocation for the start address of each function.
Conclusion:
This proposal aims to ensure that GSYM can accurately identify the correct line table entries for merged functions. Symbolication using this new attribute would require additional logic in GSYM tooling – but that is outside of this RFC. What are your thoughts on this proposal?