[RFC] Proposed update to handling debug locations in LLVM

tl;dr we (Sony) propose to make DebugLoc a mandatory argument for creating instructions, and add new types of DebugLoc to document the reason for missing line numbers.

Background

Currently, we have rules for updating debug locations during optimizations[1], but the result of these rules can be opaque and there is no strict enforcement that they are followed. Because of this, we run the risk of decaying line table quality: building the CTMark projects in the llvm-test-suite repo with -O2 -g, we find that the number of unique source lines (excluding “line 0” entries) has decreased by ~3% since LLVM 10, with the worst case bullet seeing an ~8.3% reduction in unique source lines, and almost all projects seeing a gradual downwards trend in line table quality.

Although this could be resolved by having the debug info folk spend more developer time on fixing bad line locations, I believe this is a suboptimal solution: it is challenging to determine whether an instruction created by an optimization should have a line number and what that line number should be without deeper knowledge of the optimization or its implementation. From my own experience recently digging through missing line numbers, most cases are fairly simple (usually where a user has simply neglected to copy a line number to a new instruction that replaces an old), but there are also non-trivial cases: optimizations that rewrite blocks of instructions, such as reassociation or vectorization. For these optimizations there are not always easy answers for mapping the new instructions to the old, and the implementation may obfuscate that mapping to a developer unfamiliar with its details.

Proposal

There are 2 steps that I believe will improve the quality of debug lines:

  1. Make DebugLoc a mandatory argument to all the instruction constructors.
  2. Create a set of “special” DebugLocs to encode the cause of a missing debug line.

The argument for (1) is simple: the vast majority of instructions ought to be assigned some kind of DebugLoc when they are created, and requiring the DebugLoc to be assigned as a separate step after construction makes it easier to forget. This change also makes it easier to find the source of bugs where DebugLocs are dropped unnecessarily, since it is much easier to search for instructions created with an explicit empty DebugLoc than to search for instructions that do not have setDebugLoc called at some point after creation. There is one specific insertion pattern that this doesn’t play nicely with, which is creating an instruction without a DebugLoc and then using IRBuilder::Insert which sets the DebugLoc on the inserted instruction; in these cases, either setting a temporary DebugLoc or using the IRBuilder to create the instruction will work.

The argument for (2) is that it will make it easier to detect real bugs and is a prerequisite for some further improvements discussed below. For every case where we drop debug line information, we currently have either an empty DebugLoc or a line 0 location - an indication that for some reason, there is no source line associated with that instruction. It would be easier to detect and clean up broken line information if we could distinguish the missing DebugLoc cases:

  • Is this a purely compiler-generated instruction?
  • Is this a merge of multiple instructions with different line numbers?
  • Has this instruction been moved outside of its original control-flow?
  • Does this instruction have an attribution that is either unclear or infeasible to determine right now?
  • Is this instruction intended to be temporary, such that its line number doesn’t matter?
  • Have we simply neglected to attach a non-empty DebugLoc?

By including why a DebugLoc was dropped, tools such as Debugify can filter out cases where the developer has hinted the missing location is intentional, vastly improving its ability to detect regressions. This is similar to work by Nikola Tesic to remove false positives from debugify[2], albeit with a slightly different approach that gives a greater level of detail. It also gives us more information about why we’re losing debug lines, which may improve our ability to maintain or verify debug info for some of these cases:

  • Purely compiler-generated instructions should be easy to verify and should be emitted with line 0.
  • Out-of-control-flow instructions can be easily verified by a human, and potentially by a tool, as the rules for when they occur are well-defined. They can also still be attributed to their original source line, as long as it is clear to consumers that reaching the instruction does not imply the source line’s position in the original control flow of the source program has been reached (this is not possible in DWARF right now, but would be an easy extension to create).
  • Instructions that don’t have clear attribution may be salvageable, either with changes to the pass implementation or with future improvements to our debug info representation.
  • Temporary instructions can be searched for at the end of passes/compilation, such that any instruction with a DILocation marked as “temporary” is a clear error.
  • Instructions that are simply missing debug locations (when the front-end is emitting debug info) are errors that should ideally be caught in review, but otherwise can be fixed-up later.

This also puts the expectation on developers to document their handling of debug line information through their choice of debug location.

Implementation

Adding this information to every site where instructions are created is a lengthy task. In terms of process, the first step should be to make DebugLoc an optional argument to all instruction-creation methods, defaulting to an empty debug location. This makes it possible to work gradually, by updating one or more instruction types at a time until DebugLoc is mandatory for all instructions.

The special DebugLocs themselves can be simple: there needs to be an extra field added to DILocation to indicate its type, either a simple enum, or a bitmask if we want to represent combinations of these. These special DebugLocs can continue to use line 0 as usual, and creating them with nullptr arguments should yield an empty DebugLoc, so that use of these calls does not need to be conditional on debug info being enabled.

// New static methods to create DILocations.
static DILocation *DILocation::createCompilerGenerated(DISubprogram *SP);
static DILocation *DILocation::createMerged(DILocation *A, DILocation *B);
static DILocation *DILocation::createMisplaced(DILocation *Orig);
static DILocation *DILocation::createTemporary(DISubprogram *SP);
static DILocation *DILocation::createUnknown(DISubprogram *SP);

// Empty DebugLocs should only be created by front-ends emitting instructions
// without line information.
DebugLoc::DebugLoc();

// Also add convenience update methods.
void DILocation::setMisplaced();
void Instruction::setMisplacedDebugLoc();
void DILocation::mergeWith(DILocation *Other);
void Instruction::mergeDebugLocWith(DILocation *Other);

Future work

Besides improving the quality of line information by making debug info easier to maintain and harder to accidentally break, there are some features that we’re considering implementing in LLVM that will build on or benefit from this new approach.

Misplaced instructions

One problem with the current representation of debug lines is that there are two clashing concerns that cannot be resolved: for the purposes of profiling and some debugging use cases, we want to drop the debug line information for an instruction that has been moved out of its original control-flow, such as with speculatively executed instructions. However, for the purposes of crash dumps or in some cases while debugging we would still like to know which source instruction was responsible for the generated instruction.

Right now there is no way to represent this in LLVM or DWARF, but it would be relatively simple (on a technical level) to add an extra bit to the line table to indicate that an instruction has been “misplaced”, meaning that it cannot be used to infer control-flow (so will be ignored by profilers) but can still be used to get a source location for a crash dump. Implementing this in LLVM requires us to identify source locations that have been moved outside of their original control-flow, which is trivially achieved by the proposal above.

Key instructions

The debug experience of stepping behaviour in optimized code can be quite poor; as instructions are merged, split, hoisted and sunk, the original control flow of the program becomes mangled, and in particular stepping often goes back-and-forth between the same lines. The reason for this is that a given source line may have generated many instructions, which have been spread to different places in the program, and we currently will simply step on all of them. We are investigating the use of Key Instructions (Caroline Tice & Prof. Susan L. Graham, 2000)[3] to resolve this issue, by identifying which instructions are responsible for source-level state changes (variable assignments, branches, function calls) and using those “key instructions” for stepping. Besides requiring a better handling of debug lines to ensure we don’t incorrectly drop key instructions, implementing this in LLVM will require new DebugLoc update rules that can be implemented as an extension to this API.

[1] How to Update Debug Info: A Guide for LLVM Pass Authors — LLVM 19.0.0git documentation
[2] [RFC][Debugify] False positives elimination - #3 by ntesic
[3] https://www.researchgate.net/publication/2432347_Key_Instructions_Solving_the_Code_Location_Problem_for_Optimized_Code

5 Likes

Thanks for all the work/thoughts here.

I’m generally in favour of the direction. We can haggle over the memory usage concerns of changing DebugLocs in the relevant reviews, maybe there are more compact encodings we can use (like uint32 (or uint64, whatever we’re using) max for the line number (and then -1, -2, etc for the different bits we want to communicate) but that doesn’t make it easy to carry the “misplaced” /and/ the original line location at the same time, etc)

I’d like to maybe suggest a minor rephrasing to some of the motivation…

One problem with the current representation of debug lines is that there are two clashing concerns that cannot be resolved: for the purposes of profiling and some debugging use cases, we want to drop the debug line information for an instruction that has been moved out of its original control-flow, such as with speculatively executed instructions. However, for the purposes of crash dumps or in some cases while debugging we would still like to know which source instruction was responsible for the generated instruction.

I don’t think these things are in conflict, nor is dropping what we “want” to do. In all cases (I think even in the sanitizer cases & maybe crash dumps) we don’t want to tell someone this codepath executed when it didn’t.

And then, as you say, given we don’t have a way to communicate this through DWARF ("here’s the line that had the load, etc, but it isn’t necessarily reached… ") we use line zero/no line. But if we had a way to encode this property, we’d be happy to provide the information for all clients - some can then ignore these locations entirely (I’d expect simple sample based profilers to ignore it entirely) but any other client would at least be recommended to render these locations only if they could also communicate this “controlling conditions that lead to this line may not be satisfied” or some other way to communicate this weid thing we’re trying to warn people of.

1 Like

That rephrasing sounds reasonable to me - the spirit of the matter is that we cannot provide attribution for some instructions without also implying false information about the program’s control flow. It’s not that these concerns are fundamentally in conflict, but that we just can’t express this fact, so we’re forced to abandon one of these concerns.

I’m not a fan of this. For example, in InstCombine we commonly return a freshly created instruction, which then gets inserted and debug location assigned by generic infrastructure. This pattern would no longer work (or regress in ergonomics) if you have to pass in a DebugLoc already during construction, instead of setting it afterwards.

1 Like

Yes, this is the pattern that is least well-served by this proposal. I hadn’t come to a general solution to all cases (this mostly happens in front-ends with the IRBuilder, where using DebugLoc() for instructions before insertion is most appropriate) - for cases like InstCombine, one option would be to have a “temporary” DebugLoc as a member inside whichever class is creating the new instructions, and using that location for all new instructions by default. This should reduce the ergonomic cost of the mandatory DebugLoc argument as long as that member is available, while also ensuring that if one of the created instructions does not have a valid DebugLoc set on it afterwards then we can detect it immediately as an error. Does that sound like a reasonable solution to you?

For example, in InstCombinerImpl we’d add a field DebugLoc TempDebugLoc that is set to DILocation::createTemporary(F.getSubprogram()) in the constructor. Then for every instance where InstCombinerImpl creates an instruction, we use TempDebugLoc as the source location, which means adding an extra argument to each call but requires no further changes.

2 Likes

Not really. To me, this indicates that requiring the DebugLoc during construction is not really the correct modelling for the domain. (Though requiring the DebugLoc during insertion instead also doesn’t work, for other cases.)

If you go down that route, I’d probably require use of IRBuilder for everything in InstCombine, and remove support for the “return instruction to insert” pattern entirely – it seems like at that point it may be more hassle than it’s worth.

As a general alternative, rather than enforcing this in C++, would it make sense to have a verifier check instead? That basically says that if the function has a DISubprogram attachment, then all the instructions in it must also have a !dbg location? I believe we already have this check for calls, so this may be a natural extension. (Though TBH I have only ever encountered false-positives from that check, which required explicit supression with 0 locs. It never provided useful signal.)

1 Like

Not really. To me, this indicates that requiring the DebugLoc during construction is not really the correct modelling for the domain. (Though requiring the DebugLoc during insertion instead also doesn’t work, for other cases.)
…
As a general alternative, rather than enforcing this in C++, would it make sense to have a verifier check instead?

Hm, that’s a reasonable insight. The concept of the temporary in my mind is a promise that even if we can’t supply a correct DebugLoc right now, we’re confident that the instruction will either get one later or will not be emitted in the final object. What we really what to represent is that no instruction that doesn’t have a valid origin for its DebugLoc should appear in a function. This could be achieved by having the argument be an optional (default DebugLoc()) and using just a verifier check afterwards, with some minor adjustments: most notably, this would mean attaching DebugLocs to any instructions that are inlined from a function without debug info, but that might be necessary regardless.

I think there’s a good argument in favour of compiler enforcement: we do not get a huge amount of test coverage of debug builds on the buildbots - we mostly rely on the lit tests, unit tests, and stage 2 clang builds. Combined with the ease of forgetting about debug info when updating passes, I think it would be quite easy to miss errors introduced by new commits if we only used verifier checks. Compiler enforcement is not total protection since we could still have manually-added temporary/empty DebugLocs that persist further than intended and aren’t caught by the verifier, but this at least forces developers and reviewers to make a judgement call before merging rather than simply forgetting. Also, it’s easier to incrementally update the codebase with compiler checks; we can perform fixes one instruction-type at a time, rather than needing to fix everything before turning the verifier checks on.

I also take your point that using temporary DebugLocs in InstCombine and other locations demonstrates that construction isn’t necessarily the right place for this; I think the most accurate representation of what we intend would be a Builder pattern for Instructions (not the IRBuilder), which would allow us to either set DebugLocs immediately or defer until insertion, but could still enforce (potentially at compile time) that we are setting a DebugLoc at some point. Creating this sounds like it could be even more hassle than using the IRBuilder for everything, but it would be the more correct option imo; in practice, if there’s strong objection to requiring DebugLocs in the constructor then the next best approach may be to just perform these checks in the verifier.

Hi, coming back to this after a while - I’ve been looking into this and have an alternative proposal. I agree that making DebugLocs a mandatory instruction argument may not be the right model, at least at the moment. Using the verifier has difficulties as well; one being that we could only enable these checks on main when they would pass, which could take some time to achieve; another being that the verifier can only produce results at the end of a pass and doesn’t otherwise help narrow down the origin of a problem, unlike the compiler which gives us the exact source location of an error.

My alternative proposal is to start off using debugify. Using debugify-each in the “original DI preservation” mode can give us a detailed report of DebugLoc errors; currently it also picks up cases where we drop DebugLocs intentionally. By adding extra information to DebugLocs, similar to the original proposal, and tracking this info in Debugify, prototyped here, we can remove false positives from debugify and identify the location of bugs in LLVM. Eventually we should be able to move some of these checks into the verifier to guard against regressions, and pursue the followup extensions from the original proposal.

1 Like

Could debugify be used in a more fuzzer/regression kind of way - like, if you’re going to stand up a buildbot or other continuous infrastructure, then having a way to flag regressions in debug info tracking would be quite valuable/actionable, compared to flagging existing problems (useful, but a separate channel - for people working on that backlog, rather than for the random developer that added a new transformation, etc)

1 Like

Yes, that’s hopefully where we’ll end up. I think we first need to eliminate as many existing bugs as possible (which in some cases may just mean marking them “Unknown” as a fixme), otherwise it would be hard to prevent this regression suite from flagging new patches as being responsible for existing errors. I don’t have any immediate plans to set up such a suite, but I’ve been using the LLVM test suite with debugify to investigate existing issues; I think that would be a reasonable starting point.

As a followup, the set of patches that implement my more recent proposal are being opened for review now - the first of which can be found here. The aim of these patches is that, once landed, we could begin to use debugify to detect debug line regressions in CI, as suggested by @dblaikie.

Such a suite would only be useful if we can be assured that any new bugs detected are actually new, which means as a matter of practicality it should only run on code that has no existing bugs reported, as even if we tried to avoid flagging bugs that existed in prior builds, it is too easy for innocuous changes to cause existing DebugLoc bugs to manifest in different places or appear slightly different.

NB: this doesn’t mean the output must be perfect - the DebugLoc::getUnknown() value can be used as a “FIXME” to annotate cases where we haven’t solved the problem yet, but we don’t need it reported by debugify.

1 Like