Motivation
The current AI compiler stack is growing increasingly complex, utilizing multiple intermediate representations before reaching machine code. For instance, a modern AI kernel written in a high-level Python or C++ DSL (such as Triton or CuTile) typically traverses a deep pipeline: DSL Source → TileIR/MLIR → LLVM IR → … → Assembly. As these compilation stacks deepen, kernel writers and end users require deeper insight to extract maximum performance and pinpoint bottlenecks at each specific layer. Consequently, there is an urgent need for profiling tools capable of mapping machine instructions not just to high-level source code, but also back to specific intermediate representations. With this in mind we propose modifying DebugLoc and relevant parts of LLVM to carry additional line information, without associated full scope information.
Without this capability:
- Power users and compiler developers cannot correlate hot spots to the IR layer they are interested in. The profiler can show the DSL source line, but cannot map the bottleneck back to the corresponding construct in the intermediate representation. A gap that widens as DSL-to-GPU compilation stacks grow deeper.
Summary
We propose extending LLVM’s DebugLoc to carry multiple source locations per instruction. A primary location mapping to the high-level source and one or more additional locations mapping to intermediate representations in the compilation pipeline. The extension modifies DebugLoc to allow it to reference an MDTuple instead of a single DILocation, making it transparent to passes that only care about the primary location while enabling profilers to build separate line tables for each IR level. A reference implementation targeting the NVPTX backend demonstrates the full pipeline: IR representation, bitcode round-trip, location merging, inlining support.
Additionally, we introduce a mechanism for capturing intermediate source code in memory and embedding it directly within the final ELF file.
Reference implementation: [DRAFT][LLVM][DEBUG] Reference implementation of multi-level line table support by ayermolo · Pull Request #205453 · llvm/llvm-project · GitHub
Design
At a high level, the design extends DebugLoc to optionally carry an additional intermediate locations alongside the primary one
The core of this design involves widening the storage of DebugLoc from a specific DILocation* to a generic MDNode*. While a standard DebugLoc typically wraps a single DILocation, this transition enables the slot to hold either a bare location - preserving zero overhead for the common case - or an MDTuple that pairs the primary source mapping with one or more additional intermediate locations:
!dbg = !{ DILocation primary,
!{ MDString kind, DILocation intermediateLoc }, ; layer 1 (e.g. "MyMidIR")
!{ MDString kind, DILocation intermediateLoc }, ; layer 2 (optional)
... }
Concrete example:
!20 = !{!11, !17} ; composite location
!11 = !DILocation(line: 2, column: 5, scope: !8) ; primary: source.py:2:5
!17 = !{!"TileIR", !16} ; intermediate tag
!16 = !DILocation(line: 100, column: 10, scope: !15) ; intermediate: tileir:100:10
Operand 0 is always the primary DILocation; every later operand is an inner !{ kind, loc } sub-tuple, with the kind string naming the IR layer. The intermediate DILocation needs only a minimal scope, a DILexicalBlockFile (more on this below), enough to point at its DIFile and tie it to the compile unit, without copying the full source scope chain. DebugLoc hides the tuple from callers: get() returns the primary location, and separate accessors reach the intermediate ones. Existing passes keep working unchanged: merging combines the operands in parallel, equality compares both levels, bitcode round-trips the tuple, and the verifier checks its shape.
The intermediate IR text is stored separately, in a module-level named-metadata list whose entries have the form !{ kind, filename, source_text }; the DIFile in the intermediate location links an instruction to it. At emission, the backend writes the usual primary line directive followed by one secondary directive per intermediate tuple.
On the object side, the assembler lowers those secondary directives into a separate section in DWARF .debug_line format. The IR text goes into sections named by a hash of their contents, so the linker can fold identical bodies together via COMDAT, with the DWARF file descriptor pointing at them. Tools can then map a machine address to both the source line and the IR line, reading the IR text straight from the binary.
The framework is target-agnostic: the IR representation, verifier, and merging logic are backend-independent. Any target that emits DWARF line tables can opt in by emitting the secondary directives and the intermediate-source section, reusing the standard object and linker machinery.
Consumer impact. Because the additional line tables use the standard DWARF .debug_line format, existing tools need almost no changes. A consumer reuses its current .debug_line parser unchanged. It only has to recognize the new section names and follow the file entry to the embedded IR text. A profiler then maps an address to an IR-layer line the same way it already maps one to a source line, with no new parsing logic and no DIE-tree walking. The section layout and file-table referencing are covered in Final ELF Layout and Linking.
Verifier
At the minimum the !dbg metadata check is relaxed to accept either a DILocation or an MDTuple whose first operand is a DILocation:
CheckDI(isa<DILocation>(N) ||
(isa<MDTuple>(N) && N->getNumOperands() > 0 &&
isa<DILocation>(N->getOperand(0))),
"invalid !dbg metadata attachment", &I, N);
Intermediate Location Scope
Intermediate debug locations do not carry a full parallel, to source language, scope chain. They only need to capture enough information to reference the intermediate source file, that only exists in memory, and tie it back to the compile unit. The reference implementation achieves this by using DILexicalBlockFile to store the virtual intermediate filename, with the parent scope pointing to the enclosing DISubprogram:
!15 = !DIFile(filename: "tileIR_source.123", directory: ".")
!16 = !DILexicalBlockFile(scope: !8, file: !15, discriminator: 0)
!17 = !DILocation(line: 100, column: 10, scope: !16)
Here !8 is the DISubprogram of the function being compiled. The DILexicalBlockFile provides the file association without introducing a new subprogram or lexical block hierarchy for the intermediate representation.
This works but is a pragmatic reuse of an existing node type whose primary purpose (discriminator-based scope splitting for the sample profiler) is unrelated. For clarity and to avoid overloading DILexicalBlockFile semantics, a dedicated class extending DILocalScope (e.g., DIIntermediateScope) could be introduced. Such a class would explicitly model the “virtual file reference within an existing subprogram” concept and make the intent self-documenting in both the IR and the verifier.
Location Merging
DebugLoc::getMergedLocation() is responsible for reconciling locations when instructions are fused during transformations like instruction combining, CSE, or hoisting. The design extends this logic to handle composite locations by merging each channel independently: the primary source mapping and any intermediate locations are reconciled in parallel using the existing merging heuristics.
The merging process follows these steps:
- Handle empty operands. If one input lacks location data, the other is preserved in its entirety, including all intermediate metadata. This ensures that merging with unlocated code does not strip away layer-specific information.
- Primary location reconciliation. Standard
DILocation::getMergedLocation()rules apply to the first operand: identical sites are preserved, while differing positions within a shared scope collapse line/column data toward zero. Divergent function scopes result in a null location. - Intermediate layer merging. Every intermediate location is processed using the same algorithm, isolated from the primary channel. If a specific layer is present in both inputs, they are merged; if only one side provides it, that mapping is carried forward.
- Tuple reconstruction. If any intermediate mapping remains valid after the merge, the result is re-wrapped into a
!{primary, {kind, intermediate}}tuple. Otherwise, it returns a bareDILocation, maintaining zero-cost for standard compilation paths. These composite nodes are uniqued to prevent metadata bloat.
Crucially, these channels degrade independently. A merge might preserve an exact primary mapping while the intermediate IR location collapses (e.g., merging two instructions from the same source line but different IR constructs), or vice versa (e.g., the same IR instruction appearing in different inline contexts). Since the results are not coupled, the intermediate metadata must be explicitly merged rather than derived from the primary source mapping.
For multi-layer stacks, channels are paired based on their kind tag. Layers common to both inputs are merged pairwise, while unique layers are preserved. If a merge results in an empty location for a specific kind, that layer is dropped from the final composite tuple while others are retained.
Inlining
When a function is inlined the primary DILocation keeps its own line and column but gains an inlined-at link back to the call site, preserving the source-level call stack. Callee instructions with no debug info are given the call site’s location (both primary and intermediate), attributing them to the call.
The intermediate location is carried through unchanged. It gets no inlined-at chain. An intermediate location only needs to identify a position in the IR layer, so it deliberately carries no inline scope; the source-level inline context is already recorded on the primary. As a result the intermediate line table doesn’t encode LLVM-level inlining: the same IR construct inlined at several sites maps to the same IR-layer location. Inline disambiguation lives in the primary line table, while the intermediate table just maps back to the IR text.
; Underlying nodes (before inlining):
src_callee = DILocation(line: 7, col: 3, scope: @callee) ; primary
mid_callee = DILocation(line: 42, col: 1, scope: MidIR_file) ; intermediate
src_call = DILocation(line: 20, col: 5, scope: @caller) ; the call site
; Callee body instruction, before inlining:
%x = add …, !dbg !{ src_callee, {"TileIR", mid_callee} }
; Same instruction, after inlining into the caller:
%x = add …, !dbg !{
DILocation(line: 7, col: 3, scope: @callee, inlinedAt: src_call), ; primary
{"TileIR", mid_callee} ; intermediate
}
Bitcode Serialization
The intermediate location is serialized as a second FUNC_CODE_DEBUG_LOC record immediately following the primary one. The record carries an 8th field containing the MDString ID which the reader uses to reconstruct the !{kind, DILocation} tuple via appendIntermediateDebugLoc().
The ValueEnumerator is updated to enumerate both the intermediate DILocation operands and the kind MDString so they are available during bitcode emission.
Source Code Embedding
Intermediate IR source code—which does not exist as an on-disk file—is stored in !llvm.intermediate_level_source module-level named metadata:
!llvm.intermediate_level_source = !{!30, !31}
!30 = !{!"TileIR", !15, !"entry @add_kernel() { ... }"}
!31 = !{!"TileIR", !18, !"entry @other_kernel() { ... }"}...!15 = !DIFile(filename: "tileIR_source.123", directory: ".")
!16 = !DILexicalBlockFile(scope: !8, file: !15, discriminator: 0)
!17 = !DILocation(line: 100, column: 10, scope: !16)
Binary Representation and Line Tables
Both the primary source mapping and every intermediate IR layer utilize the standard DWARF .debug_line encoding. While the primary table correlates machine-code addresses to high-level source coordinates, each additional layer contributes a parallel .debug_line program within its own section, mapping those same addresses to intermediate positions. This reuse of established formats ensures composability. Linker can treat new sections just like .debug_line. Consequently, existing DWARF consumers can parse these additional layers without modification, requiring only recognition of the new section nomenclature.
Every .debug_line program incorporates a file table in its header, indexed by the file register in each matrix row. In the primary table, these entries reference on-disk source files. For intermediate layers where source text exists only in memory, the file-table entries resolve to dedicated ELF sections that embed the IR text directly. These sections utilize a specific naming convention. For example:
.debug_txt.<IR_lang>.<hash>
Within this scheme, <IR_lang> identifies the specific layer (e.g., tileir, mlir) and <hash> represents the MD5 sum of the IR text. Consumers navigating the secondary line-number program follow the file descriptor to the corresponding debug_txt section, allowing them to fetch the embedded IR body without consulting external files.
When multiple translation units embed identical IR bodies, they generate identically named text sections, each residing in a COMDAT group (SHT_GROUP) keyed by the hash signature. The linker retains a single instance per signature, collapsing redundant IR text into one copy in the final image.
A representative layout for a module with one intermediate layer follows:
.debug_line ; primary: PC -> source.py:line:col
.debug_line.tileir ; intermediate: PC -> tileir:line:col
.debug_txt.tileir.<hash> ; embedded TileIR body (deduplicated)
Under this architecture, a profiler can resolve a machine-code address to both the high-level source and any intermediate IR level. By consulting the primary and secondary line tables, the tool fetches the deduplicated IR text directly from the binary, providing a self-contained representation of the entire compilation stack.
Impact on compiler performance
On the default path – where no intermediate location is attached – the DebugLoc stores a plain DILocation pointer, exactly as it does today. The only added cost is a single dyn_cast<DILocation> check in DebugLoc::get(), which succeeds on the first branch and short-circuits immediately.
The cost of the MDTuple representation is incurred only when an intermediate location is actually accessed, and only at the point where DebugLoc is dereferenced to extract the DILocation—not on every construction or assignment. In practice this means the feature is pay-as-you-go: compilation pipelines that do not attach intermediate locations see no measurable regression.
Metadata and IR memory footprint. In scenarios where intermediate mappings are utilized, the overhead is primarily confined to metadata expansion rather than computational complexity. Instructions referencing secondary locations swap a direct DILocation pointer for an MDTuple container—encompassing the {primary, {kind, intermediate}} structure and associated sub-tuples. Supporting entities, including the MDString tags, DIFile references, and DILexicalBlockFile scopes, are uniqued at the module level. This ensures they are allocated once per function and layer, effectively amortizing the cost across all instructions. Furthermore, instructions sharing an IR-layer coordinate reference the same uniqued DILocation node. The resulting marginal growth is limited to the wrapper tuples and a supplementary debug-location record in bitcode, while unannotated modules remain entirely unchanged.
Source Embedding. While the intermediate IR text represents the most significant size contribution when active, it is stored as uniqued MDStrings within a module-level named metadata list. This separation ensures that the source text is not processed by individual passes, preserving compiler throughput. The memory footprint is strictly bounded by the volume of unique IR code. On the binary side, this text is subject to COMDAT-based deduplication across translation units, as detailed in Final ELF Layout and Linking, ensuring that redundant IR bodies do not inflate the final object size.
Benefits
-
Zero-cost when unused. If no intermediate location is attached,
DebugLocbehaves identically to the current implementation; a plainDILocationpointer. The overhead of the MDTuple representation is incurred only when an intermediate location is present, and only at dereference time. -
Leverages established infrastructure. The implementation maintains the composite location within
DebugLoc’s original storage as a raw, untrackedMDNode*, avoiding the need for specialized tracking logic. This approach remains robust because annotatedMDTuples are uniqued and handled by existing remapping facilities during cloning or inlining. Textual IR representation utilizes standardMDTuplesyntax, requiring only a minor relaxation of the verifier to permit the specific{DILocation, {MDString, DILocation}}structure while still identifying genuinely invalid nodes. -
Transparent to existing passes. Passes that call
getDebugLoc()/setDebugLoc()or useIRBuilder::SetCurrentDebugLocation()continue to work unmodified. TheDebugLoc::get()accessor transparently unwraps the MDTuple to return the primaryDILocation. -
Composable. The design supports an arbitrary number of intermediate locations per instruction (each as an additional operand in the MDTuple), and each intermediate location is tagged with a kind string (e.g.,
"TileIR","MLIR") allowing multiple IR levels to coexist. -
Standard DWARF output. The secondary line table is emitted as a standard DWARF
debug_line-format section, meaning existing DWARF consumers (e.g.llvm-dwarfdump) require only the addition of the new section name; no parser changes. -
Source embedding. Intermediate representation source code (which has no on-disk file) can be embedded directly in the output via
!llvm.intermediate_level_sourcemodule-level metadata.