[RFC] Multi-level line table support in LLVM

Motivation

The current AI compiler stack is growing increasingly complex, utilizing multiple intermediate representations before reaching machine code. For instance, a modern AI kernel written in a high-level Python or C++ DSL (such as Triton or CuTile) typically traverses a deep pipeline: DSL Source → TileIR/MLIR → LLVM IR → … → Assembly. As these compilation stacks deepen, kernel writers and end users require deeper insight to extract maximum performance and pinpoint bottlenecks at each specific layer. Consequently, there is an urgent need for profiling tools capable of mapping machine instructions not just to high-level source code, but also back to specific intermediate representations. With this in mind we propose modifying DebugLoc and relevant parts of LLVM to carry additional line information, without associated full scope information.

Without this capability:

  • Power users and compiler developers cannot correlate hot spots to the IR layer they are interested in. The profiler can show the DSL source line, but cannot map the bottleneck back to the corresponding construct in the intermediate representation. A gap that widens as DSL-to-GPU compilation stacks grow deeper.

Summary

We propose extending LLVM’s DebugLoc to carry multiple source locations per instruction. A primary location mapping to the high-level source and one or more additional locations mapping to intermediate representations in the compilation pipeline. The extension modifies DebugLoc to allow it to reference an MDTuple instead of a single DILocation, making it transparent to passes that only care about the primary location while enabling profilers to build separate line tables for each IR level. A reference implementation targeting the NVPTX backend demonstrates the full pipeline: IR representation, bitcode round-trip, location merging, inlining support.

Additionally, we introduce a mechanism for capturing intermediate source code in memory and embedding it directly within the final ELF file.

Reference implementation: [DRAFT][LLVM][DEBUG] Reference implementation of multi-level line table support by ayermolo · Pull Request #205453 · llvm/llvm-project · GitHub

Design

At a high level, the design extends DebugLoc to optionally carry an additional intermediate locations alongside the primary one

The core of this design involves widening the storage of DebugLoc from a specific DILocation* to a generic MDNode*. While a standard DebugLoc typically wraps a single DILocation, this transition enables the slot to hold either a bare location - preserving zero overhead for the common case - or an MDTuple that pairs the primary source mapping with one or more additional intermediate locations:

!dbg = !{ DILocation primary,
         !{ MDString kind, DILocation intermediateLoc },   ; layer 1 (e.g. "MyMidIR")
         !{ MDString kind, DILocation intermediateLoc },   ; layer 2 (optional)
         ... }

Concrete example:

!20 = !{!11, !17}                          ; composite location
!11 = !DILocation(line: 2, column: 5, scope: !8)   ; primary: source.py:2:5
!17 = !{!"TileIR", !16}                    ; intermediate tag
!16 = !DILocation(line: 100, column: 10, scope: !15) ; intermediate: tileir:100:10

Operand 0 is always the primary DILocation; every later operand is an inner !{ kind, loc } sub-tuple, with the kind string naming the IR layer. The intermediate DILocation needs only a minimal scope, a DILexicalBlockFile (more on this below), enough to point at its DIFile and tie it to the compile unit, without copying the full source scope chain. DebugLoc hides the tuple from callers: get() returns the primary location, and separate accessors reach the intermediate ones. Existing passes keep working unchanged: merging combines the operands in parallel, equality compares both levels, bitcode round-trips the tuple, and the verifier checks its shape.

The intermediate IR text is stored separately, in a module-level named-metadata list whose entries have the form !{ kind, filename, source_text }; the DIFile in the intermediate location links an instruction to it. At emission, the backend writes the usual primary line directive followed by one secondary directive per intermediate tuple.
On the object side, the assembler lowers those secondary directives into a separate section in DWARF .debug_line format. The IR text goes into sections named by a hash of their contents, so the linker can fold identical bodies together via COMDAT, with the DWARF file descriptor pointing at them. Tools can then map a machine address to both the source line and the IR line, reading the IR text straight from the binary.

The framework is target-agnostic: the IR representation, verifier, and merging logic are backend-independent. Any target that emits DWARF line tables can opt in by emitting the secondary directives and the intermediate-source section, reusing the standard object and linker machinery.

Consumer impact. Because the additional line tables use the standard DWARF .debug_line format, existing tools need almost no changes. A consumer reuses its current .debug_line parser unchanged. It only has to recognize the new section names and follow the file entry to the embedded IR text. A profiler then maps an address to an IR-layer line the same way it already maps one to a source line, with no new parsing logic and no DIE-tree walking. The section layout and file-table referencing are covered in Final ELF Layout and Linking.

Verifier

At the minimum the !dbg metadata check is relaxed to accept either a DILocation or an MDTuple whose first operand is a DILocation:

CheckDI(isa<DILocation>(N) ||
        (isa<MDTuple>(N) && N->getNumOperands() > 0 &&
         isa<DILocation>(N->getOperand(0))),
        "invalid !dbg metadata attachment", &I, N);

Intermediate Location Scope

Intermediate debug locations do not carry a full parallel, to source language, scope chain. They only need to capture enough information to reference the intermediate source file, that only exists in memory, and tie it back to the compile unit. The reference implementation achieves this by using DILexicalBlockFile to store the virtual intermediate filename, with the parent scope pointing to the enclosing DISubprogram:

!15 = !DIFile(filename: "tileIR_source.123", directory: ".")
!16 = !DILexicalBlockFile(scope: !8, file: !15, discriminator: 0)
!17 = !DILocation(line: 100, column: 10, scope: !16)

Here !8 is the DISubprogram of the function being compiled. The DILexicalBlockFile provides the file association without introducing a new subprogram or lexical block hierarchy for the intermediate representation.

This works but is a pragmatic reuse of an existing node type whose primary purpose (discriminator-based scope splitting for the sample profiler) is unrelated. For clarity and to avoid overloading DILexicalBlockFile semantics, a dedicated class extending DILocalScope (e.g., DIIntermediateScope) could be introduced. Such a class would explicitly model the “virtual file reference within an existing subprogram” concept and make the intent self-documenting in both the IR and the verifier.

Location Merging

DebugLoc::getMergedLocation() is responsible for reconciling locations when instructions are fused during transformations like instruction combining, CSE, or hoisting. The design extends this logic to handle composite locations by merging each channel independently: the primary source mapping and any intermediate locations are reconciled in parallel using the existing merging heuristics.

The merging process follows these steps:

  1. Handle empty operands. If one input lacks location data, the other is preserved in its entirety, including all intermediate metadata. This ensures that merging with unlocated code does not strip away layer-specific information.
  2. Primary location reconciliation. Standard DILocation::getMergedLocation() rules apply to the first operand: identical sites are preserved, while differing positions within a shared scope collapse line/column data toward zero. Divergent function scopes result in a null location.
  3. Intermediate layer merging. Every intermediate location is processed using the same algorithm, isolated from the primary channel. If a specific layer is present in both inputs, they are merged; if only one side provides it, that mapping is carried forward.
  4. Tuple reconstruction. If any intermediate mapping remains valid after the merge, the result is re-wrapped into a !{primary, {kind, intermediate}} tuple. Otherwise, it returns a bare DILocation, maintaining zero-cost for standard compilation paths. These composite nodes are uniqued to prevent metadata bloat.

Crucially, these channels degrade independently. A merge might preserve an exact primary mapping while the intermediate IR location collapses (e.g., merging two instructions from the same source line but different IR constructs), or vice versa (e.g., the same IR instruction appearing in different inline contexts). Since the results are not coupled, the intermediate metadata must be explicitly merged rather than derived from the primary source mapping.

For multi-layer stacks, channels are paired based on their kind tag. Layers common to both inputs are merged pairwise, while unique layers are preserved. If a merge results in an empty location for a specific kind, that layer is dropped from the final composite tuple while others are retained.

Inlining

When a function is inlined the primary DILocation keeps its own line and column but gains an inlined-at link back to the call site, preserving the source-level call stack. Callee instructions with no debug info are given the call site’s location (both primary and intermediate), attributing them to the call.

The intermediate location is carried through unchanged. It gets no inlined-at chain. An intermediate location only needs to identify a position in the IR layer, so it deliberately carries no inline scope; the source-level inline context is already recorded on the primary. As a result the intermediate line table doesn’t encode LLVM-level inlining: the same IR construct inlined at several sites maps to the same IR-layer location. Inline disambiguation lives in the primary line table, while the intermediate table just maps back to the IR text.

; Underlying nodes (before inlining):
src_callee = DILocation(line: 7,  col: 3, scope: @callee)      ; primary
mid_callee = DILocation(line: 42, col: 1, scope: MidIR_file)   ; intermediate
src_call   = DILocation(line: 20, col: 5, scope: @caller)      ; the call site

; Callee body instruction, before inlining:
%x = add …, !dbg !{ src_callee, {"TileIR", mid_callee} }

; Same instruction, after inlining into the caller:
%x = add …, !dbg !{
    DILocation(line: 7, col: 3, scope: @callee, inlinedAt: src_call), ; primary
    {"TileIR", mid_callee}                                          ; intermediate
}

Bitcode Serialization

The intermediate location is serialized as a second FUNC_CODE_DEBUG_LOC record immediately following the primary one. The record carries an 8th field containing the MDString ID which the reader uses to reconstruct the !{kind, DILocation} tuple via appendIntermediateDebugLoc().

The ValueEnumerator is updated to enumerate both the intermediate DILocation operands and the kind MDString so they are available during bitcode emission.

Source Code Embedding

Intermediate IR source code—which does not exist as an on-disk file—is stored in !llvm.intermediate_level_source module-level named metadata:

!llvm.intermediate_level_source = !{!30, !31}
!30 = !{!"TileIR", !15, !"entry @add_kernel() { ... }"}
!31 = !{!"TileIR", !18, !"entry @other_kernel() { ... }"}...!15 = !DIFile(filename: "tileIR_source.123", directory: ".")
!16 = !DILexicalBlockFile(scope: !8, file: !15, discriminator: 0)
!17 = !DILocation(line: 100, column: 10, scope: !16)

Binary Representation and Line Tables

Both the primary source mapping and every intermediate IR layer utilize the standard DWARF .debug_line encoding. While the primary table correlates machine-code addresses to high-level source coordinates, each additional layer contributes a parallel .debug_line program within its own section, mapping those same addresses to intermediate positions. This reuse of established formats ensures composability. Linker can treat new sections just like .debug_line. Consequently, existing DWARF consumers can parse these additional layers without modification, requiring only recognition of the new section nomenclature.

Every .debug_line program incorporates a file table in its header, indexed by the file register in each matrix row. In the primary table, these entries reference on-disk source files. For intermediate layers where source text exists only in memory, the file-table entries resolve to dedicated ELF sections that embed the IR text directly. These sections utilize a specific naming convention. For example:

.debug_txt.<IR_lang>.<hash>

Within this scheme, <IR_lang> identifies the specific layer (e.g., tileir, mlir) and <hash> represents the MD5 sum of the IR text. Consumers navigating the secondary line-number program follow the file descriptor to the corresponding debug_txt section, allowing them to fetch the embedded IR body without consulting external files.

When multiple translation units embed identical IR bodies, they generate identically named text sections, each residing in a COMDAT group (SHT_GROUP) keyed by the hash signature. The linker retains a single instance per signature, collapsing redundant IR text into one copy in the final image.
A representative layout for a module with one intermediate layer follows:

.debug_line                  ; primary: PC -> source.py:line:col
.debug_line.tileir           ; intermediate: PC -> tileir:line:col
.debug_txt.tileir.<hash>     ; embedded TileIR body (deduplicated)

Under this architecture, a profiler can resolve a machine-code address to both the high-level source and any intermediate IR level. By consulting the primary and secondary line tables, the tool fetches the deduplicated IR text directly from the binary, providing a self-contained representation of the entire compilation stack.

Impact on compiler performance

On the default path – where no intermediate location is attached – the DebugLoc stores a plain DILocation pointer, exactly as it does today. The only added cost is a single dyn_cast<DILocation> check in DebugLoc::get(), which succeeds on the first branch and short-circuits immediately.

The cost of the MDTuple representation is incurred only when an intermediate location is actually accessed, and only at the point where DebugLoc is dereferenced to extract the DILocation—not on every construction or assignment. In practice this means the feature is pay-as-you-go: compilation pipelines that do not attach intermediate locations see no measurable regression.

Metadata and IR memory footprint. In scenarios where intermediate mappings are utilized, the overhead is primarily confined to metadata expansion rather than computational complexity. Instructions referencing secondary locations swap a direct DILocation pointer for an MDTuple container—encompassing the {primary, {kind, intermediate}} structure and associated sub-tuples. Supporting entities, including the MDString tags, DIFile references, and DILexicalBlockFile scopes, are uniqued at the module level. This ensures they are allocated once per function and layer, effectively amortizing the cost across all instructions. Furthermore, instructions sharing an IR-layer coordinate reference the same uniqued DILocation node. The resulting marginal growth is limited to the wrapper tuples and a supplementary debug-location record in bitcode, while unannotated modules remain entirely unchanged.

Source Embedding. While the intermediate IR text represents the most significant size contribution when active, it is stored as uniqued MDStrings within a module-level named metadata list. This separation ensures that the source text is not processed by individual passes, preserving compiler throughput. The memory footprint is strictly bounded by the volume of unique IR code. On the binary side, this text is subject to COMDAT-based deduplication across translation units, as detailed in Final ELF Layout and Linking, ensuring that redundant IR bodies do not inflate the final object size.

Benefits

  1. Zero-cost when unused. If no intermediate location is attached, DebugLoc behaves identically to the current implementation; a plain DILocation pointer. The overhead of the MDTuple representation is incurred only when an intermediate location is present, and only at dereference time.

  2. Leverages established infrastructure. The implementation maintains the composite location within DebugLoc’s original storage as a raw, untracked MDNode*, avoiding the need for specialized tracking logic. This approach remains robust because annotated MDTuples are uniqued and handled by existing remapping facilities during cloning or inlining. Textual IR representation utilizes standard MDTuple syntax, requiring only a minor relaxation of the verifier to permit the specific {DILocation, {MDString, DILocation}} structure while still identifying genuinely invalid nodes.

  3. Transparent to existing passes. Passes that call getDebugLoc() / setDebugLoc() or use IRBuilder::SetCurrentDebugLocation() continue to work unmodified. The DebugLoc::get() accessor transparently unwraps the MDTuple to return the primary DILocation.

  4. Composable. The design supports an arbitrary number of intermediate locations per instruction (each as an additional operand in the MDTuple), and each intermediate location is tagged with a kind string (e.g., "TileIR", "MLIR") allowing multiple IR levels to coexist.

  5. Standard DWARF output. The secondary line table is emitted as a standard DWARF debug_line-format section, meaning existing DWARF consumers (e.g. llvm-dwarfdump) require only the addition of the new section name; no parser changes.

  6. Source embedding. Intermediate representation source code (which has no on-disk file) can be embedded directly in the output via !llvm.intermediate_level_source module-level metadata.

8 Likes

Might want to cross-reference this with @ZequanWu & this other RFC: [RFC] Multi-Sloc DWARF line table extension probably have some overlap involving multiple source locations per instruction, whether they’re to encode ambiguity or multiple languages.

I have a somewhat similar, yet different, use case:

I am working on a database system which uses LLVM to compile SQL queries.

In our case, we have the following levels of “source code”:

  1. the user-provided SQL query
  2. the algebra tree (which has “operator ids” instead of line / column; I guess I would simply encode the “operator id” as line number :person_shrugging:)
  3. an internal intermediate representation
  4. LLVM IR

From what I can tell, your proposal would help us to keep track of all 3 types of debug locations - which would be great!

To which components are you planning to contribute this support? LLVM only? MLIR? LLDB? lldb-dap? llvm-objcopy --strip-debug? llvm-objcopy --strip-debug?

Also, two more in-detail thoughts below…

Use DWARF instead of ELF to embed source files?

Intuitively, I would have expected the design here to be closely aligned with the existing -gembed-source (clang) and -Z embed-source (Rust). This embeds the source into a proprietary DWARF attribute. But thanks to https://dwarfstd.org/issues/180201.1.html, DWARFv6 should include first-class support for this.

Afaik, this brings benefits such as the already existing support for those embedded source files in LLDB. Furthermore, going down this route, multi-level line tables would be forward-compatible with other upcoming DWARF features such as source URLs.
Did you also consider DWARF-encoding as an alternative to ELF sections?

Besides DWARF vs ELF encoding, I wonder if also on the LLVM IR level, we could get inspiration from -gembed-source?

Make source file embedding optional?

I would see “embedded source code” as an orthogonal dimensions to “multi-level line table”, and would like to be able to use multi-level line tables also with actual on-disk files.

I prefer to have the intermediate source files as actual files on disk, and not embedded inside the LLVM module / ELF object.
Having actual files on disk makes it simpler to integrate with other tooling (language servers, text search across multiple module dumps, etc.), and allows me to, e.g., set breakpoints in VS-Code by simply opening that temporary file, but on the downside it can quickly fill up the /tmp folder.

Hence, I think it would be great to decouple multi-level line tables from embedding, and make it usable also independently.

2 Likes

Sounds great.

GCC has actually supported similar functionality for quite some time, although it does not take the IR layering you mentioned into account. GCC introduced this mainly to better support inline scenarios.

such as:
0x00000000072206e0 3046 5 4 0 0 0 is_stmt
0x00000000072206e0 3046 5 4 0 0 0 is_stmt
0x00000000072206e0 3046 5 4 0 0 0 is_stmt
0x00000000072206e6 3046 5 4 0 0 0 is_stmt
0x00000000072206e6 3049 9 4 0 0 0 is_stmt

In previous work, we also found that when BOLT optimizes GCC-compiled programs and updates their debug information, a large amount of .debug_line information can be lost. The main reason is that LLVM currently does not support multiple source locations for a single instruction, which leads to this loss.

Regarding the current proposal for function-local metadata (FLMD), which refactors the in-memory representation of source locations into indexed storage vectors rather than using the LLVM Metadata model: there are major code collisions between that change and this one, since under those changes an Instruction will no longer contain a DILocation/MDNode pointer, and DILocation itself will become an MDNode-based wrapper for an FLMD reference.

I think it is quite possible to implement this within the FLMD model however. To summarize, the FLMD model represents a DebugLoc using a 32-bit “SrcLoc” index into a vector of {Line, Column, Scope}, a 16 bit atom group+rank field (used for key instructions), and a 16-bit InlinedAt index into a vector of inlined call info ({SrcLoc, InlinedAt, InlineeFn}). To add multi-level line table support, the only addition we need is for an Instruction to reference a tuple of SrcLocs rather than just one - the inlinedAt and atom fields need no changes, since we only track them for the primary location.

A simple implementation could be to reserve a “continuation bit” in the SrcLoc object, i.e. using a high 1 bit to indicate that the SrcLoc at the next index is part of the same tuple, and a high 0 bit to indicate a single location/end-of-tuple. The encoding of the particular IR layer could be shifted to the scope, e.g. adding an optional operand to DILexicalBlockFile that also encodes the IR Layer string. These changes add almost no cost to non-multi-level line tables, similar to the original approach, while enabling support for layered locations on a pay-as-you-go basis.

A better implementation might take into account the practical characteristics of multi-level line tables, e.g. do most/all instructions have the same number of layers within a given function, and how strong is the correlation between the source locations across different layers? Even without optimizing for the expected case though, I expect that FLMD can support a working implementation of this proposal, and will improve performance over a Metadata-based implementation, though we can’t be certain until a practical FLMD implementation is available to measure.

1 Like

I think there is some overlap, but fundamentally we are targeting to different use cases.
That proposal somewhat uses/abuses DWARF spec to preserve some optimized away information.
For this case I think we do need a completely separate section that tools can easily reason about. Now my proposal goes with a completely separate section name that integrates layer name into it, but follows the same ELF/DWARF spec as .debug_line. We could be cheeky about it, and just have name as .debug_line, and say last file name be a special marker to mark it as referring to some intermediate representation. That way when all object files are linked we end up with one .debug_line. For source level sections things will just work and tools can find appropriate sections using DW_AT_stmt_list. Although if tools rely purely on .debug_line then this would break things as over sudden we will have the same PCs in multiple sections.

  1. Components to contribute to
    a) LLVM yes
    b) MLIR if there is interest (I need to finish upstreaming https://github.com/llvm/llvm-project/pull/186146, as internal implementation relies on it).
    c) llvm-objcopy yes, should be minor change.
    d) LLDB/lldb-dap. Not sure how useful this proposal is for that as it’s purely line table, with no debug_info support. Primary usage model in mind was for profilers.
  2. Honestly I forgot about -gembed-source. I vaguely remembered there was DWARF6 proposal, but didn’t want to rely on spec that hasn’t been formalized yet. Regarding -gemebed-source. As implemented it’s only for DWARF5. I guess we can separate it into 1) Carry information through LLVM IR 2) How it will look in cubin. For 1) yeah I think that should work, and frankly simplify things. The IR name can be carried in DIIntermediateScope as an additional field. 2) We either need to extend that support for pre DWARF5 line tables (which includes tool support), or emit as current proposal for DWARF4 and below, and the current way for DWARF5+.
  3. source file support
    There is nothing in proposal that precludes emitting source files, vs embeding in the object/binary. DIFile by deafult is designed to point to file on disk. Going with IR representation of embed-source would make it trivial to distinguish between the two. If embeded source exists it goes into elf section, otherwise it is assumed FE written our IR to a file. Something like that? BE then just needs to decide what to do during emission. Original proposal was designed with workflow in mind where binary is profiled on a system where source code is not available, or in JIT environment where there might not be access to underlying file system.

Well BOLT re-writes .debug_line (from what I remember) so really you can do your own thing. :slight_smile:

From what I remember DILoc has an index into a SmallVector SrcLocs. That can be vector of vectors where SrcLocs[0] is primary one, and SrcLocs[i] is for whatever intermediate layer. With X upper bits in SrcLocIdx index into layer part.

But yeah I think proposal can be adopted to FLMD representation.

What is your timeline for it? If this proposal is accepted by the community I was thinking through order of operations to implement both. :slight_smile: