tl;dr we (Sony) propose to make DebugLoc a mandatory argument for creating instructions, and add new types of DebugLoc to document the reason for missing line numbers.
Background
Currently, we have rules for updating debug locations during optimizations[1], but the result of these rules can be opaque and there is no strict enforcement that they are followed. Because of this, we run the risk of decaying line table quality: building the CTMark projects in the llvm-test-suite
repo with -O2 -g
, we find that the number of unique source lines (excluding “line 0” entries) has decreased by ~3% since LLVM 10, with the worst case bullet
seeing an ~8.3% reduction in unique source lines, and almost all projects seeing a gradual downwards trend in line table quality.
Although this could be resolved by having the debug info folk spend more developer time on fixing bad line locations, I believe this is a suboptimal solution: it is challenging to determine whether an instruction created by an optimization should have a line number and what that line number should be without deeper knowledge of the optimization or its implementation. From my own experience recently digging through missing line numbers, most cases are fairly simple (usually where a user has simply neglected to copy a line number to a new instruction that replaces an old), but there are also non-trivial cases: optimizations that rewrite blocks of instructions, such as reassociation or vectorization. For these optimizations there are not always easy answers for mapping the new instructions to the old, and the implementation may obfuscate that mapping to a developer unfamiliar with its details.
Proposal
There are 2 steps that I believe will improve the quality of debug lines:
- Make DebugLoc a mandatory argument to all the instruction constructors.
- Create a set of “special” DebugLocs to encode the cause of a missing debug line.
The argument for (1) is simple: the vast majority of instructions ought to be assigned some kind of DebugLoc when they are created, and requiring the DebugLoc to be assigned as a separate step after construction makes it easier to forget. This change also makes it easier to find the source of bugs where DebugLocs are dropped unnecessarily, since it is much easier to search for instructions created with an explicit empty DebugLoc than to search for instructions that do not have setDebugLoc
called at some point after creation. There is one specific insertion pattern that this doesn’t play nicely with, which is creating an instruction without a DebugLoc and then using IRBuilder::Insert
which sets the DebugLoc on the inserted instruction; in these cases, either setting a temporary DebugLoc or using the IRBuilder
to create the instruction will work.
The argument for (2) is that it will make it easier to detect real bugs and is a prerequisite for some further improvements discussed below. For every case where we drop debug line information, we currently have either an empty DebugLoc or a line 0 location - an indication that for some reason, there is no source line associated with that instruction. It would be easier to detect and clean up broken line information if we could distinguish the missing DebugLoc cases:
- Is this a purely compiler-generated instruction?
- Is this a merge of multiple instructions with different line numbers?
- Has this instruction been moved outside of its original control-flow?
- Does this instruction have an attribution that is either unclear or infeasible to determine right now?
- Is this instruction intended to be temporary, such that its line number doesn’t matter?
- Have we simply neglected to attach a non-empty DebugLoc?
By including why a DebugLoc was dropped, tools such as Debugify can filter out cases where the developer has hinted the missing location is intentional, vastly improving its ability to detect regressions. This is similar to work by Nikola Tesic to remove false positives from debugify[2], albeit with a slightly different approach that gives a greater level of detail. It also gives us more information about why we’re losing debug lines, which may improve our ability to maintain or verify debug info for some of these cases:
- Purely compiler-generated instructions should be easy to verify and should be emitted with line 0.
- Out-of-control-flow instructions can be easily verified by a human, and potentially by a tool, as the rules for when they occur are well-defined. They can also still be attributed to their original source line, as long as it is clear to consumers that reaching the instruction does not imply the source line’s position in the original control flow of the source program has been reached (this is not possible in DWARF right now, but would be an easy extension to create).
- Instructions that don’t have clear attribution may be salvageable, either with changes to the pass implementation or with future improvements to our debug info representation.
- Temporary instructions can be searched for at the end of passes/compilation, such that any instruction with a DILocation marked as “temporary” is a clear error.
- Instructions that are simply missing debug locations (when the front-end is emitting debug info) are errors that should ideally be caught in review, but otherwise can be fixed-up later.
This also puts the expectation on developers to document their handling of debug line information through their choice of debug location.
Implementation
Adding this information to every site where instructions are created is a lengthy task. In terms of process, the first step should be to make DebugLoc an optional argument to all instruction-creation methods, defaulting to an empty debug location. This makes it possible to work gradually, by updating one or more instruction types at a time until DebugLoc is mandatory for all instructions.
The special DebugLocs themselves can be simple: there needs to be an extra field added to DILocation to indicate its type, either a simple enum, or a bitmask if we want to represent combinations of these. These special DebugLocs can continue to use line 0 as usual, and creating them with nullptr
arguments should yield an empty DebugLoc, so that use of these calls does not need to be conditional on debug info being enabled.
// New static methods to create DILocations.
static DILocation *DILocation::createCompilerGenerated(DISubprogram *SP);
static DILocation *DILocation::createMerged(DILocation *A, DILocation *B);
static DILocation *DILocation::createMisplaced(DILocation *Orig);
static DILocation *DILocation::createTemporary(DISubprogram *SP);
static DILocation *DILocation::createUnknown(DISubprogram *SP);
// Empty DebugLocs should only be created by front-ends emitting instructions
// without line information.
DebugLoc::DebugLoc();
// Also add convenience update methods.
void DILocation::setMisplaced();
void Instruction::setMisplacedDebugLoc();
void DILocation::mergeWith(DILocation *Other);
void Instruction::mergeDebugLocWith(DILocation *Other);
Future work
Besides improving the quality of line information by making debug info easier to maintain and harder to accidentally break, there are some features that we’re considering implementing in LLVM that will build on or benefit from this new approach.
Misplaced instructions
One problem with the current representation of debug lines is that there are two clashing concerns that cannot be resolved: for the purposes of profiling and some debugging use cases, we want to drop the debug line information for an instruction that has been moved out of its original control-flow, such as with speculatively executed instructions. However, for the purposes of crash dumps or in some cases while debugging we would still like to know which source instruction was responsible for the generated instruction.
Right now there is no way to represent this in LLVM or DWARF, but it would be relatively simple (on a technical level) to add an extra bit to the line table to indicate that an instruction has been “misplaced”, meaning that it cannot be used to infer control-flow (so will be ignored by profilers) but can still be used to get a source location for a crash dump. Implementing this in LLVM requires us to identify source locations that have been moved outside of their original control-flow, which is trivially achieved by the proposal above.
Key instructions
The debug experience of stepping behaviour in optimized code can be quite poor; as instructions are merged, split, hoisted and sunk, the original control flow of the program becomes mangled, and in particular stepping often goes back-and-forth between the same lines. The reason for this is that a given source line may have generated many instructions, which have been spread to different places in the program, and we currently will simply step on all of them. We are investigating the use of Key Instructions (Caroline Tice & Prof. Susan L. Graham, 2000)[3] to resolve this issue, by identifying which instructions are responsible for source-level state changes (variable assignments, branches, function calls) and using those “key instructions” for stepping. Besides requiring a better handling of debug lines to ensure we don’t incorrectly drop key instructions, implementing this in LLVM will require new DebugLoc update rules that can be implemented as an extension to this API.
[1] How to Update Debug Info: A Guide for LLVM Pass Authors — LLVM 19.0.0git documentation
[2] [RFC][Debugify] False positives elimination - #3 by ntesic
[3] https://www.researchgate.net/publication/2432347_Key_Instructions_Solving_the_Code_Location_Problem_for_Optimized_Code