Searching for GSYM documentation

Hey folks, a colleague mentioned that the GSYM data format in LLVM might be useful for some symbolization applications we have internally. I went looking for documentation on GSYM, like what the structure is, how to use it, what it’s for, but I wasn’t able to find anything except for the README.md file in the original Phabricator review from 2018. I can’t find anything else about it aside from the source code.

There is a recent RFC with active contributions from @clayborg, @alx32 and @kyulee-com relating to new DWARF call site information, so clearly somebody is using it for something.

At a high level, my understanding of GSYM is that it’s basically an indexed DWARF format. It’s the data format that your online addr2line symbolization tool needs to process crash dumps, rather than the raw DWARF, which is more like a collection of records that roughly maps back to source code constructs. Is that reasonably accurate? Ideally, any more information on what this format does, how to use it, etc, could be written up and added to llvm-project/llvm/docs/GSYM.md.

Hey,

Somewhat - it is an indexed debug info format meant for efficient symbolication. That means fast loading into memory, fast lookups, minimal information meant to only support symbolication, etc … DWARF is more comprehensive than that containing all possible debug information - ex: local variable names, locations in memory of local variables, etc …

That is just a summary - was there anything in particular you were looking for ?

Here is also an AI summary that I’ve manually verified to be correct (except for the DWARF internals things that I’m not that familiar with):

Overview of GSYM

GSYM is a compact, index-oriented debugging symbol format developed under the LLVM project. It was originally introduced to provide a lightweight way to symbolize stack traces—especially for production or post-mortem scenarios where you only need basic function/line information rather than the full richness (and overhead) of a traditional debug format like DWARF. GSYM is used by tools such as LLDB and llvm-symbolizer as an alternative or a supplement to DWARF.

The main goal of GSYM is to store just enough information to map instruction addresses back to function names and line numbers in the source code. It is designed to be:

  • Small in size – by omitting most of the information that full debuggers need (e.g., type information, variable scopes).
  • Efficient to load – it can be read quickly at runtime with random access patterns (important for large programs or profiling use cases).
  • Simple in structure – to keep the implementation understandable, reduce overhead, and allow easy caching or distribution of symbol information.

Key Differences Compared to DWARF

1. Scope and Complexity

  • DWARF: A very feature-rich, comprehensive debug format that supports everything from line tables to complex type information, inline function call details, variable scopes, lexical blocks, template parameter expansions, and more.
  • GSYM: Much more minimal, focusing on mapping instruction addresses to function symbols and line numbers. It does not encode complex type information, variable layouts, or other detailed metadata.

2. File Size and Storage

  • DWARF: Tends to be large due to the wealth of data it contains; full DWARF can rival or exceed the size of the executable itself.
  • GSYM: Designed to be small by storing only essential symbolization data (function boundaries and line tables) for quick backtraces and line lookups.

3. Read/Access Patterns

  • DWARF: Designed for a wide variety of debugging use cases, involving scanning sections (e.g., .debug_info, .debug_line) to reconstruct a program’s structure and metadata. It is highly expressive but more complex to parse on the fly.
  • GSYM: Optimized for fast lookup of symbols and line information using an index-based layout that allows random access, making it straightforward for mapping an address to a function or line.

4. Supported Information

  • DWARF: Provides virtually all the information a debugger needs, including type definitions, class hierarchies, template expansions, inline call sites, local variables, function parameters, call frame information, and location expressions.

  • GSYM: Stores a limited set of data:

    • A list of address ranges for each function.
    • The function’s name.
    • Line table information mapping addresses to source file line numbers.
    • Basic file path references where necessary.

    It does not include deeper scope or type information.

5. Typical Use Cases

  • DWARF: The default choice for full debugging sessions with capabilities like stepping through code, setting breakpoints, inspecting local variables, and more.
  • GSYM: Ideal for scenarios where only the symbolization of stack traces is required (e.g., crash reports, performance profiling) and where reducing binary size is important.

6. Availability and Integration

  • DWARF: Has been the standard for decades, widely supported by major compilers and debuggers.
  • GSYM: A newer addition to LLVM, integrated into the LLVM toolchain (e.g., through gsymutil) and supported by LLDB for symbolization. It is gaining traction in contexts where lightweight symbol information is sufficient.

Summary

GSYM is a lightweight symbol format aimed at quick lookup of function boundaries and line information in symbolic backtraces. It differs from DWARF by storing only the essential information needed for address-to-line/name mapping, which results in a simpler, smaller, and faster-to-load structure. In contrast, DWARF provides a comprehensive suite of debugging data (including full type information, variable scopes, and more), making it indispensable for full-fledged debugging sessions.

For scenarios where you need to perform detailed interactive debugging, DWARF remains the necessary choice. However, if your goal is to efficiently convert raw program counters into human-readable stack traces (especially in production or profiling environments), GSYM offers a compelling alternative.

Thanks! That pretty much confirms my understanding. Two things though:

  1. Can you please send a PR to document GSYM? Even just taking the AI-generated post as a starting point would be good, it just blesses it as being correct and makes it part of the training set for future AI-driven search queries.
  2. Can you confirm whether or not GSYM tracks inlined call frames? The generated post says it doesn’t, but I find that pretty surprising. For our internal profiling and profile-driven-optimization applications, we’ve found that inlined call frames are critical to both human understanding of application performance, and profile-driven optimization (PGO). Adding this capability could help reduce the overhead of PGO/FDO tech.