RFC: Extending gSYM Format with Call Site Information for Merged Function Disambiguation

TLDR: Proposing to extend gSYM format with call site information for each function in order to help merged function disambiguation.

Context:

In a previous RFC and PR we added support for storing merged function information in gSYM. Now we need a way to choose which function to show when looking at a call stack. This proposal is about adding the extra information we need to do this.

Problems We’re Trying to Solve

Here’s an example call stack:

Frame0: Addr00 ....
Frame1: Addr01 ....
Frame2: Addr02 ....
Frame3: Addr03 ....
Frame4: Addr04 [Merged01,Merged02,Merged03]

We need to figure out which of Merged01, Merged02, or Merged03 to show for Frame4. The easiest way is to look at who called this function. Using the return address pointing into Frame3::Addr03 we can check which function is expected to be called from within Frame3 just before Addr03 and select [Merged01,Merged02,Merged03] accordingly.

Scenarios we need to consider in this design:

  1. Simple case: Frame3:Addr03 is in the same binary as Frame4 and it directly calls it. This is the basic case and this doesn’t require any advanced logic to figure out which merged function to select - Frame3:Addr03 will directly specify that it’s calling Frame4
  2. Calls across dylibs: It may be the case that Frame4 is in one dyblib and Frame3 is in another dylib. This complicates the scenario because the information for Frame3 and Frame4 will be in separate gSYM files - so we have to have a way of resolving merged functions where the information about what function is called and the information about the merged functions are present in separate gSYMs.
  3. Calls through system libraries: It may be the case that Frame04 is called through a system library (ex: dispatch_client_callout or enumerate_objects_with_options) so the information in the immediate frame is not helpful and we would need to look further up the stack. Ex: Frame02 is the actual caller, then Frame03 is dispatch_client_callout then Frame04 is the called function.
  4. Virtual function calls: In this case the target could be a list of known functions, or we might know the function name but not the class name.
  5. ObjC selector calls: Similar to #4 above, where we would know the selector name but not the class name.

Proposed solution

We want to change the gSYM format like this:

For each function, add a list of all the calls it makes. For each call, we’ll store:

  • Where the function returns to
  • Some extra flags (like if it calls a different part of the program)
  • A pattern to match the function name (RegEx)

The above information should give us all the information we need in order to successfully select the appropriate merged function information. When walking the callstack, at each point we can retrieve information about the expected name of the next call (via a regex). We can pass this regex to the resolution of the following frame - so if there are a list of merged functions for the next frame, the regex can be used to match the correct function name.

If the target function is known precisely and the target function is in the same gSYM, we can simply reuse the already present function name in the string table, to not use additional storage for a new string.

Backward compatibility

This will be additional information for a FunctionInfo - for older version clients that do not support this information, they can just ignore it.

Input data

We can input data from multiple sources. DWARF-5 has support for callsites so we can get callsite information from there. However, it only supports callsite information for direct calls. We can gather additional callsite information from the compiler for more complex scenarios like virtual functions, calls through system libraries or ObjC selector invocations).
For the alternate input methods, as well as for testing reasons we can have the callsite information also be loaded from a .json (or text) file - passed to the llvm-gsymutil via the command line. This will provide the callsites in text format, specifying the callsite return address and the rest of the information mentioned above.

1 Like

It appears that we must analyze all the stack frames sequentially due to their dependencies on preceding frames. For example, in cases where Frame3 and Frame4 contain merged functions in a sequence, it’s essential to process them in the correct order. Please take these scenarios into account:

Frame0: Addr00 ....
Frame1: Addr01 ....
Frame2: Addr02 ....
Frame3: Addr03 [Merged04, Merged05, Merged06]
Frame4: Addr04 [Merged01, Merged02, Merged03]

Regarding the incorporation of additional Json or text files into llvm-gsymutil, I don’t have a strong preference or recommendation. However, I’m curious if we could employ user attributes to store this supplementary information under a specific flag, akin to the lightweight IRPGO approach that integrates profile map data into attributes in DWARF. This method could potentially eliminate the need to duplicate call-site information, such as offsets, in these extra Json or text files. Also consider Yaml as opposed to Json (if you’re using a text format) as LLVM supports Yaml well, AFAIK.
cc @ellishg

Yes, this would be the longer term plan - if we can get it approved - I was thinking to use Yaml / Json so that we have an e2e testable solution as it relates to the gSYM changes. Otherwise we would have to implement the DWARF attribute before making any changes on the gSYM side since we would need input data to test.

The DWARF-based solution would be either a new tag similar to DW_TAG_call_site or an additional attribute for DW_TAG_call_site - ex DW_AT_call_match_func:

0x00xx:     DW_TAG_call_site [10] * (0x00000073) 
                  DW_AT_call_match_func [DW_FORM_strx1]   (indexed (00000001) string = "[.* selectorName]")
                  DW_AT_call_target [DW_FORM_exprloc]	(DW_OP_reg8 W8)
                  DW_AT_call_return_pc [DW_FORM_addrx]	(indexed (00000001) address = 0x0000000000000xxx)

I am thinking that having an existing e2e scenario based on .json/Yaml would make it easier to justify the new attribute.

But, I am also fine with skipping .json and going via the attribute-first route also - but it make take some time until we have working e2e functionality.

Yes, this is (implicitly) taken into account - in this RFC I was focusing on the gSYM data format extension - so the resolution logic was not described in detail.
Since the callsite information is stored in the parent function (caller) then in order to use it we must have already resolved the caller - so the constraint on resolving the stack frames sequentially is necessary to be able to use the proposed data.

PS: I realized that in my examples I flipped the callstack / frame numbering from how it is normally presented.

In my examples Frame0 is the entry point to the program (ex:main) and Frame4 would be the current PC, ex: the address where a crash happened.

Given that the use of gsym is less common than dsym, I agree that starting with a text-based input to supply additional information is acceptable for now, until the full end-to-end process is operational.

Regarding the display of stack frames where merging occurs, I am curious about how you handle them:

  • By default, without this supplementary input for merge resolution, I believe we should offer two forms: a concise form that simply displays a leading/function symbol along with the number of merging instances (indicating that this function is merged), and a verbose option that displays the list of all functions as you illustrated.

  • When a function is resolved to a specific function with the supplementary input, I think we should still tag the number of merging instances to the frame. What if we identify a set of functions that match a regular expression from this supplementary data? Should we then display the list of these filtered functions while providing the original number of merging instances?

The details with how the tooling would work will be finalized in an upcoming RFC - specifically about tooling behavior - based on everyone’s input, but what I had in mind currently:

  • By default we may consider not changing the default behavior at all (is anyone relying on the current behavior matching a specific format that is to be parsed ?)
  • We would provide two output options that users of this info can use - basically the concise / verbose options you indicate.
  • Yes, we should also display information for matching functions, including if several functions match the regex.

For implementing the embedding of callsite info - the above flags are not necessary - For now I would be relying on verbose dumping the gsym to verify all is in order. The verbose dumping will display all information.

@clayborg Any comment on this direction? Who else should review this further?

Sounds like a good starting point. I think the YAML would be fine to start to make our initial design and then we need to prove that this works and is useful and doesn’t add too much data to the GSYM file. One question I have: will this new call site information be included for every single call in any function? Is it possible to know when a call might happen to a merged function? I fear this information will take up too much room if every call site has some information that is required.

The lookup APIs can be changed to take a calling function name so that we can give back the right result, or we can default back to the default information we were showing before. We can probably make a new API that will return an array of LookupResults for when we want all merged functions to show and then add a “–all” option to the lookup options for the command line tool that will display them all.

2 Likes

For DWARF-5 callsite data, we can keep only the calls that target merged functions. For data from other sources that use name matching, we don’t know the exact target so we might have to keep all of them, though we perhaps could do some filtering.