[RFC] New DWARF attribute for symbolication of merged functions

Abstract:
This RFC proposes adding a new DWARF attribute, DW_AT_LLVM_stmt_sequence, for each function. This attribute aims to resolve the loss of line table data in functions that have been merged during optimization processes like linker ICF. It will ensure that line table entries can always be distinctly linked to their respective functions.

Background:
Relevant Pull request: [DebugInfo] Add flag to enable function-level debug line attribution

In collaboration with Greg Clayton, while integrating merged function information into GSYM, we encountered instances where line table data was lost in the .dSYM due to the current representation in DWARF. Typically, each compile unit possesses a unique line table in a .o file before linking. Post-optimization, such as linker ICF, functions might merge within a compile unit, leading multiple functions overlapping at an address. Consequently, when relocations are applied to the line table, it could result in overlapping sequences describing the same address range. This overlap complicates symbolication, as DWARF consumers might retrieve a line entry from an incorrect sequence. When translating DWARF to GSYM, querying the LLVM line table for address-matching rows might return incorrect data, as it always fetches the first matching sequence for a particular address.

Proposal:
To ensure accurate retrieval of line table data for each function in DWARF, we propose adding a new attribute, DW_AT_LLVM_stmt_sequence. This attribute would store the offset within a line table that uniquely identifies the start of a line table sequence that has the line table solely for this function. For instance, if two inlined functions are merged, this attribute would help pinpoint the correct line table entries for each function.
Consider the following line table structure:

.debug_line:
0x00000000: prologue
0x00000030: sequence[0] for function "foo"
   0x1000: foo.h:20
   0x1010: foo.h:21
   0x1020: foo.h:21 end_sequence
0x00000080: sequence[1] for function "bar"
   0x1000: bar.h:44
   0x1010: bar.h:45
   0x1020: bar.h:45 end_sequence

Here, both sequences describe the range [0x1000-0x1020). By using the DW_AT_LLVM_stmt_sequence attribute, we can uniquely associate a line table with its function:

0x00000032:   DW_TAG_subprogram
            	DW_AT_low_pc (0x1000)
            	DW_AT_high_pc (0x1020)
            	DW_AT_name ("foo")
            	DW_AT_LLVM_stmt_sequence (0x00000030)
0x00000094:   DW_TAG_subprogram
            	DW_AT_low_pc (0x1000)
            	DW_AT_high_pc (0x1020)
            	DW_AT_name ("bar")
            	DW_AT_LLVM_stmt_sequence (0x00000080)

Advantages:

  • Significantly enhances the speed of symbolication for individual addresses by eliminating the need to parse the entire line table.
  • Accurately retrieves line tables for merged functions.
  • Maintains backward compatibility; if unknown, the DW_AT_stmt_sequence does not alter the existing DWARF format.
  • Does not require modifications to the .debug_line encoding for ICF/merged functions.

Disadvantages:

  • Each function would require its own sequence, whereas currently, consecutive functions often share a sequence.
  • More relocations will be generated in the .debug_line section now that each function requires its own line sequence where one of the first items will be an address which requires relocation. Prior to this many functions might share one line sequence where there wasn’t a relocation for the start address of each function.

Conclusion:
This proposal aims to ensure that GSYM can accurately identify the correct line table entries for merged functions. Symbolication using this new attribute would require additional logic in GSYM tooling – but that is outside of this RFC. What are your thoughts on this proposal?

1 Like

I’m unclear how this speeds up symbolication (translating an address to a symbol, or perhaps symbol+line, right?) and avoids parsing the entire line table. I understand that you don’t want to get into the deep details of GSYM but at least a high-level view of how the algorithm changes would be helpful.

That part is referring to symbolication using dSYM - ex: llvm-addr2line
The idea is that when looking up the line information for an address the DW_AT_META_stmt_sequence offset can be used to parse only the line entries for the function containing the address. Normally, the entire line table would have to be parsed - until the address in question is found. This is a possibility for improvement of the lookup algorithm - adding this attribute won’t automatically improve dSYM lookup speed. I am not sure of how lookup is currently performed, but the current line table format would require parsing the whole line until the address in question is found.

Symbolication speed for GSYM would not be improved by this feature. For gSYM the only benefit would be the ability to correctly attribute line information in the case of merged functions.

(please update the post/pull requests/etc to cross-link to previous discussions and pull requests - it’s difficult to follow issues when they end up being discussed multiple times separately)

I think/guess we discussed this somewhere before, but might’ve been an offline thread, so I guess this’ll rehash some of the same questions…

How does having a link from subprogram to line table improve the situation? If you ICF two functions together, it’s still going to be confusing to the user, right - you’ll risk ending up describing function A when they were really in function B (I recall the most common case of this on Windows/MSVC which defaults to aggressive ICF under optimizations, was that a lot of dtors got folded together - so you’d end up looking at backtraces which mention dtors for types for which there are no variables of that type in the scope)

But this does mean that at least the file/line number and the function name will match - so it’ll only be one level of confusion, rather than two. Is that the value you mean/are aiming for here?

Or are you also doing things like call_site based disambiguation to try to get the right description from context despite the ambiguous merged function?

Oh, and in terms of standardizing something - I’d suggest we’d want some lookup table in the line table, to avoid adding more relocations.

Oh, I guess we could avoid linker-relocations without needing to modify the line table: If these offsets are relative to the start of the line table, that’d probably be good.

Anyone that actually uses DWARF to symbolicate something, like in a atos or llvm-symbolizer has to parse the entire line table for the source file once they find a DW_TAG_subprogram in the DWARF that contains the address in question. With this change, you will be able to skip all other functions because each function’s line entries are in their own sequence. So the old flow is:

- use accelerator table to find CU that contains address, or scan the CU range(s) to see if a PC is in a particular CU
- find the DW_TAG_subprogram that contains the PC
- parse entire line table for the source file and find the match you care about

The new flow would be similar, but the last stage you would parse only the line table sequence you care about for the function you need.

This solution only is trying to fix us being able to get the right information for the function we find in the DWARF. If we have two function that were merged, then right now we always find the first line table entries for that range and then attribute them to both “foo” and “bar”…

And tools that parse the DWARF and convert it to another format, like GSYM, are able to get the line tables right for each function.

Right now GSYM will pick one of the functions and omit any others when the ranges are exactly the same, but we want to change this so that GSYM saves all of the ranges and we hope to add additional data that says if called by "func1", then use "foo" or if called by "func2", then use "bar".

But this change is solely to allow anyone parsing DWARF to be able to get the right line sequences for the right function because right now there is no way to do the right thing

Can DWARF 5 line tables grab addresses from the .debug_addr section? This could help avoid relocations…

No, it can’t. Would be good to add at some point, but isn’t supported at the moment.

Oh, another question. You mention dsyms and that this is intended for macho somewhere, I think?

But macho doesn’t have the issue with needing to start more sequences, does it? It already assumes/only has function-sections-like behavior, one sequence per function I thought?

In any case I think the way id spec this is that the new attribute would point to the start of a sequence that includes the subprogram, but isn’t guaranteed to be the only thing in that sequence. Then you could still use it for non-function-sectioned code.

Though if you only put the attribute on functions in function sections, that’s probably fine - right? Functions that aren’t in function sections can’t be icf’d because the linket only takes whole sections at a time (on elf, on macho, you can think of subsections (subsections via symbols) as being equivalent to elf sections)

.debug_line deliberately does not depend on other sections (other than .debug_line_str) so that all other sections can be stripped and the line table will still be usable.

Right - so could go the other way, could put the address pool into the line table, to share it there. But, generally, it would be nice to be able to share relocations between the line table and debug_info.

Though, equally, now that I think about it - that’s pretty orthogonal to this issue. Even if they could be shared, that’s sharing .text relocations, not helpful for sharing cross-debug_* relocations that are the thing at issue here.

If we go with a solution where the line table points to .debug_info, then yeah that would be the case. I think clayborg was thinking about how to avoid extra relocations from .debug_line to .text because of the extra per-function set_address opcodes. If set_address had an index+offset form, indexing into a .debug_addr-like table in .debug_line, that could work.

(In cases where you used -ffunction-sections it wouldn’t gain you anything though, you’d still need one relocation per set_address regardless.)

1 Like

It seems like one justification (doing this all faster) assumes that parsing one CU’s worth of .debug_line is expensive. Is there data to support that assumption? I know it takes a bunch of code but it doesn’t seem inherently complicated, algorithmically.

The argument about attributing correct source locations in the face of ICF, when there are multiple attributiions, does seem reasonable. Today, it’s possible to find all the DW_TAG_subprograms that were folded together to contain a specific address, but no way to find the source attributions for anything other than the lucky canonical copy.

Ah, yeah, I think it’s a bit of a red herring - I don’t think there’s a need to put this attribute on functions that aren’t already in their own section anyway? (if you were doing it for parsing performance, maybe - though I share you doubt that that’s a major/real issue, wouldn’t mind seeing some data on that)

Not sure I follow - it seems equally possible to find all the subprograms and all the line table fragments that describe the ICF’d code. What you can’t do currently is know which one goes with which one - which, the only reason I can think of you’d need that is slight improvement in symbolizing quality (you get an arbitrary function, but at least the function name and line numbers/filenames all agree with each other, even if they may not agree with the code) or maybe a slightly bigger improvement if you’re using call_site-based disambiguation.

OK, so going back over the offline thread that happened before all this - seems like the goal is just the issue of mismatched line table / subprogram descriptions. I guess the performance issues are speculative at best (but if not, happy to see the data).

It’s not a complete solution to ICF, and the main problem (describing one function, when a different function was the one that was called) remains, but it does address the particular weirdness of having mismatched file/line V function name/other details.

One could /probably/ get this right without the extra data for cross-CU folding (because you would only have one copy of the line table for that offset in a single CU) - but two functions within the same CU can be folded together, and without this extra debug info there wouldn’t be a way to associate those correctly, I don’t think… - could potentially have the compiler do that ICF at compile-time, oh, but it’s probably iterative at link time (if two otherwise-identical functions call two actually identical functions, once the latter get merged then the former can be merged).

To me it seems sort of low-value (because you already end up in a bad place where you get told about the wrong function), but non-zero value (because better to be told consistently about the same wrong function than a mishmash of two different functions - and the wrong function issue could be reduced further through the use of call site info).

I think the only feedback I have is that the attribute should contain the offset from the start of the line table, so it doesn’t incur a relocation.

if it’s possible to use this all upstream with the gsym tooling that is in upstream LLVM, then I guess it should use an “LLVM” name, and if not, I guess it should use a “META” name. Hopefully the former. Be nice to teach llvm-symbolizer the same tricks, ideally, but I get that might be low priority.

The DW_AT_low_pc for each ICF’d function will be fixed up to point to the canonical copy of the function, so they wouldn’t point to their respective line-table fragments? Oh, but if you parsed the whole line table you could find all the fragments for that address. But still not know which went with which, yeah I see now.

Right, so the new attribute would let you disambiguate those cases, which isn’t possible today. In the cross-CU case you can partition the possibilities, but still doesn’t solve the general case.

The compiler can do some intra-CU folding, but then it still wants to end up emitting two distinct line-table fragments if we want the source disambiguation to happen.

So, the key compiler feature is to be able to identify the start of each line-table fragment, which I ran something up the flagpole in the PR. I am pretty sure that can be expressed in a way that gets resolved at compile-time into a constant offset (.Lline_table_fragment_for_foo - .Lline_table_start) and not need a relocation. The offset would be relative to the current CU’s .debug_line header.

1 Like

dsymutil knows how to insert end sequences and new start sequences as needed. This happes a lot when functions get dead stripped and the .o file had a single sequence for multiple functions. Having each function already have its own start sequence will make its life easier though. dsymutil re-writes any and everything it needs to, so it isn’t limited by the format itself.

That works for us from a spec perspective.

Maybe the standard linker, but doesn’t LTO (mono and thin), and BOLT ignore these kinds of restrictions?

That will work. Should we encode this with a DW_FORM_data encoding to indicate it is an offset from the DW_AT_stmt_list from the CU? We could allow this to be encoded with a DW_FORM_ref_addr which would indicate an absolute offset within .debug_line, and a DW_FORM_data would indicate it is relative? I mention this because tools like dsymutil are don’t need relocations after they are done, so if dsymutil or llvm-dwarfutil re-write the DWARF, they can make it either way.

I would vote for LLVM if everyone is on board with this. I would rather have everyone’s input and make it something the world can use and use the LLVM in the name.

Thanks everyone for the feedback ! Just wanted to follow up and explicitly clarify some things before jumping into the implementation for this:

  • Since this will be usable with gSYM in llvm upstream, the name of the attr will be DW_AT_LLVM_stmt_sequence
  • The attr will be a constant offset of type .Lline_table_fragment_for_foo - .Lline_table_start
  • The attr will only be generated if a flag is specified, and if enabled, all the subprograms in the CU will get this attr.
  • The attr will always point to the beginning of a line sequence - meaning each function will have a start and end line sequence in the line table
  • This will be implemented in the compiler (i.e. the attr will be in object files).

The implementation timeline will look roughly like:

  • RFC for assembler changes that would allow doing .Lline_table_fragment_for_foo - .Lline_table_start
  • Implement support for the attribute in the assember - i.e. the above RFC
  • Implement generating the attribute in clang
  • Implement handling the attribute in the dwarf linker
  • Implement gSYM format extension to incorporate the new data allowed by the attribute
  • Implement intelligent gSYM lookup using callsite based disambiguation