[RFC] Improving is_stmt placement for better interactive debugging

Stepping through optimized programs is often a trial, and one significant factor is jumping back and forth between source locations even in linear code. We (Sony) have prototyped a solution that reduces stepping jumpiness by making LLVM’s is_stmt placement smarter.

I’ll discuss this project in my upcoming LLVM talk “Improving optimized code line table quality”.

Quick summary

A lot of the noise in stepping comes from code motion and instruction scheduling. This is reasonable; a long expression on a single line may involve multiple operations that are moved, re-ordered, and interleaved with other instructions by optimisations. But we can eliminate a lot of that noise.

LLVM chooses whether to apply is_stmt (DWARF flag to indicate recommended breakpoint locations in the line table) using simple heuristics that are essentially just “has the line number changed since the last instruction?”.

Taking ideas from two papers [1][2] that explore the issue, especially @cmtice’s:

From the perspective of a source-level debugger user:

  • Source code is made up of interesting constructs; the level of granularity for “interesting” while stepping is typically assignments, calls, control flow. We’ll call these interesting constructs Atoms.
  • Atoms usually have one instruction that implements the functionality that a user can observe; once they step “off” that instruction, the atom is finalised. We’ll call that a Key Instruction.
  • Communicating where the key instructions are to the debugger (using DWARF’s is_stmt) avoids jumpiness introduced by scheduling non-key instructions without losing source attribution (non-key instructions retain an associated source location, they’re just ignored for stepping).

Our prototype looks like this:

  • 2 new fields added to DILocation metadata.
  • The new fields are assigned values in a pre-optimisation pass that chooses “interesting” instructions (linear scan over instructions).
  • There’s some middle end upkeep.
    • There’s a map and remap function pair that need to be called for instructions cloned when control flow is duplicated, e.g. during loop unroll. CloneBasicBlock maps the metadata and RemapInstruction(s) remaps it, so a bunch of transformations don’t actually need to do anything new in practice.
    • Other than that, the new metadata is propagated organically being a part of DILocation.
  • During DWARF emission, the new metadata is collected (linear scan over instructions) to decide is_stmt placements.

Compile-time-tracker reports this instructions:u overhead on CTMark compile times:

stage1-ReleaseLTO-g

Benchmark Old New
kimwitu++ 59076M 59540M (+0.79%)
sqlite3 61163M 61616M (+0.74%)
consumer-typeset 54620M 55142M (+0.96%)
Bullet 111784M 112269M (+0.43%)
tramp3d-v4 176555M 177689M (+0.64%)
mafft 38854M 39137M (+0.73%)
ClamAV 85975M 86536M (+0.65%)
lencod 118636M 119331M (+0.59%)
SPASS 77932M 78433M (+0.64%)
7zip 272642M 273848M (+0.44%)
geomean 89451M 90042M (+0.66%)

stage1-O0-g

Benchmark Old New
kimwitu++ 24932M 25167M (+0.94%)
sqlite3 4823M 4938M (+2.39%)
consumer-typeset 12635M 13114M (+3.79%)
Bullet 65029M 65445M (+0.64%)
tramp3d-v4 21231M 21477M (+1.16%)
mafft 6772M 6921M (+2.20%)
ClamAV 13722M 14000M (+2.02%)
lencod 12818M 13025M (+1.61%)
SPASS 14455M 14647M (+1.33%)
7zip 142937M 143449M (+0.36%)
geomean 18676M 18982M (+1.64%)

The cost of enabling the feature for builds with optimisations is quite reasonable given the improvements to the debugging experience, and there may be scope to reduce it further.

There’s a higher percentage hit for O0 builds, but I still think it’s worth enabling at O0. The feature modifies O0 debugging behaviour; if we enabled it with optimisations but not at O0 we would be providing a different baseline (O0) debugging behaviour. For example, a multi-line assignment might involve several back-and-forth jumps even at O0 without Key Instructions and is less likely to with Key Instructions.

(If you don’t care about implementation details or open questions, skip to the conclusion now)

Details

(Read the quick summary first)

Atoms having “one key instruction” is a simplification; this often isn’t true. Consider the following murky cases:

  • loop unrolling in which an assignment’s Key Instruction store gets duplicated.
  • Conditional branches in IR getting lowered to cond-br + br pairs.
  • A front end may emit multiple stores for one variable assignment. Each is a candidate Key Instruction - they may get re-ordered by optimisations, so the front end can’t tell which it’ll be ahead of time. (In this case there is ony Key Instruction, but it’s not known until after optimisation).

Fear not, these cases can still all be handled (and are handled by our prototype).

Prototype implementation

The metadata - The two new DILocation fields are atomGroup and atomRank. Both are unsigned integers. Instructions in the same function with the same (atomGroup, InlinedAt) pair are part of the same source atom. atomRank determines is_stmt preference within that group, where a lower number is higher precedence. The default atomRank of 0 indicates the instruction isn’t interesting.

Pre-optimisation - A key-instructions pass sets the new DILocation fields on instructions that it deems important for stepping. The following instructions get atomGroup numbers: ReturnInst, SwitchInst, conditional BranchInst, MemIntrinsic, StoreInst. Calls are ignored here - they’re unconditionally marked is_stmt later.

For MemIntrinsic and StoreInst we additionally add the instruction computing the stored value, if there is one, to the same atom (we give it the same atomGroup number) with a higher atomRank (lower precedence). We do the same for the condition of the conditional BranchInst. This is so that we have a good “fallback” location if / when those primary candidates get optimized away.

During optimisation - Throughout optimisation, the DILocation is propagated normally. Cloned instructions get the original’s DILocation, the new fields get merged in getMergedLocation, etc. However, pass writers need to intercede in cases where a code path is duplicated, e.g. unrolling, jump-threading. In these cases we want to emit key instructions in both the original and duplicated code, so the duplicated must be assigned new atomGroup numbers, in a similar way that instruction operands must get remapped. There’s facilities to help this: mapAtomInstance(const DebugLoc &DL, ValueToValueMapTy &VMap) adds an entry to VMap which can later be used for remapping using llvm::RemapSourceAtom(Instruction *I, ValueToValueMapTy &VM). mapAtomInstance is called from llvm::CloneBasicBlock and llvm::RemapSourceAtom is called from llvm::RemapInstruction so in many cases no additional effort is actually needed.

mapAtomInstance ensures LLVMContextImpl::NextAtomGroup is kept up to date, which is the global “next available atom number”.

The DILocations carry over from IR to MIR without any changes.

DWARF emission - Iterate over all instructions in a function. For each (atomGroup, InlinedAt) pair we find the set of instructions sharing the lowest rank number. Typically only the last of these instructions in each basic block is included in the set (not true for branches - all branches of the same rank are included). The instructions in this set get is_stmt applied to their source locations.

Evaluation

I’ve modified dexter to plot debugging traces (those changes are out of scope of this RFC). Dexter drives a debugger, single stepping through requested function while recording the source location of each step. The line number of a step relative to the function start is plotted on a graph over “debugging steps” (i.e., discrete time). We do this for several different build configurations of the debug target, which enables us to compare the stepping behaviour at a glance.

I’ve drawn the traces by hand, for O2 and O2 with key instructions, on these two images:

O2 O2 with key instructions

Plotted on a graph, including an O0 trace (green) we get:

You can see there’ a lot of bouncing around at O2 (orange), in this case due to scheduling. The Key Instructions (purple) debugging behaviour matches the O0 (green) behaviour for this simple example.

I haven’t found a good quantitative measurement to evaluate this project, so my approach has been to generate and look at a number of these graphs and to manually step through “real” code, comparing debugging optimised code with/without the feature and at O0.

I’d like to share more of these graphs - I’ll aim to do so after the conference. Here’s a preview for now, on a trace from Lua, in which you can see another win for Key Instructions.

And another from Opensteer (inlined at O2 - and consequently at O2 with Key Instructions too, since the feature doesn’t affect optimisations):

These graphs demonstrate that Key Instructions reduces the complexity of stepping in “real” code. By removing unecessary noisy steps it makes the optimized stepping experience feel closer to an unoptimized experience.

Open question(s)

Q1 Should we use a pre-optimisation pass or have front ends apply the new metadata?

We currently use a heuristic-based approach to apply the new metadata in a pre-optimisation pass. An alternative would be for front ends to apply do it when creating the instructions (i.e., with new IRBuilder functionality).

Pros of current approach:

  • Language / front end independent
  • Simple & understandable; a small loop in one place

Cons of current approach:

  • Some instructions’ DILocations are regenerated (setting the new fields), which is wasteful / slower than if the front end created them correctly (a significant portion of the O0 compile time increase)
  • Front ends have limited or no control over what is considered “interesting”

I am not a front end engineer; without diving into it, I’m not sure how difficult it would be to implement that way.

In [DWARF] Specifying that a DILocation should be is_stmt=false @khuey points out that some front ends may (a) perform their own mandatory optimisations even at O0, which is relevant to the question above, and (b) may produce instructions which are undesirable for stepping (in the example in that thread, some constant variable initialisation isn’t desirable for users to step on at O0).

For (b) - this is easy to solve if front ends are responsible for applying the new metadata (simply don’t apply it to “uninteresting” instructions).

Otherwise, we’ll want a mechanism for the front end to convey that an instruction isn’t interesting. I think the approach proposed in that thread could work orthogonally alongside Key Instructions just fine. The idea is essentially to add a “never-is-stmt” bit to DILocations so the front end can exclude some instructions from stepping. At DWARF emission stage, the “never-is-stmt” bit would have precedence over Key Instructions. It’s just a matter of working out the best way to pack the new DILocation fields (and possibly think ahead for future fields).

There’s also the hybrid support option: we could provide the pre-optimisation pass and the tools for front ends to do it themselves. We could even start with one and migrate to the other.

That’s the only major question for me right now, but I’ll add any others brought up in the thread here.

Conclusion

Our prototyped changes to the compiler improve interactive debugging of optimized-code by reducing the impact of code motion and scheduling without affecting codegen or instruction source attribution.

We’d like to implement this in LLVM so we’re seeking general approval and would welcome any concerns and general thoughts.

References

  1. Key Instructions: Solving the Code Location Problem for Optimized Code (C. Tice, . S. L. Graham, 2000)
  2. Debugging Optimized Code: Concepts and Implementation on DIGITAL Alpha Systems (R. F. Brender et al)
7 Likes

Thanks for sharing this, I am very excited about the potential here!

Side note: This might be an opportunity to optimize the in-memory encoding since I’d expect all source locations within an atom to share most of their source location details, so we could encode all DILocations belonging to the same atom group as tiny line/col diffs to a shared full DILocation.

If I understand correctly, your prototype starts out with traditional IR and then a key-instructions pass applies a heuristic to assign atoms. That sounds like a good fit for clang. Do you have any insight how this would work for frontends that already do their own optimizations before lowering to LLVM IR, such as flang (with its mlir passes) or swift (with its SIL optimizer)? Would you suggest they do the same thing at a higher level and produce already-annotated IR?

2 Likes

True, most locations in a group tend to share details, and I agree there’s definitely room to explore the in-memory representation. Another approach I wanted to try out is subclassing DILocation so that only the instances that need them carry the extra fields.

I am glad you asked as I think this getting this bit right is important. I cover this a bit in the “Open Question(s)” section, including some basic pros and cons of the current approach.

such as flang (with its mlir passes) or swift (with its SIL optimizer)

And from the thread linked in the RFC section I mentioned, it looks like Rust also does pre-LLVM optimisations too (even at O0).

Would you suggest they do the same thing at a higher level and produce already-annotated IR?

I think so. I think even for Clang it might make sense to have the front end annotate the IR while codegenning (rather than after the fact). That avoids the cost of scanning over all instructions and updating (replacing) DILocations.

I’ll have a look at what it’ll take to implement that in Clang (I don’t have a good intuition as to the challenges just yet).

I think that would be an important data point — thanks!

Thanks for sharing this RFC!

Interestingly, I also ran into issues with is_stmt just yesterday, however in my case the issue was that clang is using the is_stmt not frequently enough, while your issue is that clang is overusing it. I am now wondering what the intended semantics of is_stmt in DwarfV5 is.

My use case: “column breakpoints”, i.e. I want to enable the user to set breakpoints on individual columns within a larger expression. E.g., in the statement below, I highlighted all the places where I would like to offer breakpoints to the user:

int x = foo() + bar() + baz();
//      ^       ^       ^
// intended breakpoints: before each function call
auto my_lambda = [](int a) { return a * 2; }
//   ^                       ^
// intended breakpoints: creating the lamdba instance itself, and inside the lambda

The Dwarf standard defines is_stmt as (emphasis mine):

A boolean indicating that the current instruction is a recommended breakpoint location. A recommended breakpoint location is intended to “represent” a line, a statement and/or a semantically distinct subpart of a statement.

As such, I was hoping to use the is_stmt to find all relevant positions for column level breakpoints. However, I then realized that clang does not actually mark the individual function calls as “distinct subparts” using the is_stmt flag.

I won’t have time to teach clang to mark those function calls as is_stmt.
But I wonder how this would relate to your RFC. Afaict,

  • your RFC would mean that multiple statements on a single line would be marked as separate is_stmt instructions. E.g., my lambda example above would be correctly annotated with two is_stmt instructions: one outside the lambda, one inside the lambda body.
  • your RFC would make it easier for me to teach clang to mark the individual function calls in the first example with is_stmt. I could simply assign a separate atomGroup to the individual function calls
  • at the same time, if I would do so, I would counteract your proposal. By introducing more is_stmt instructions, I would reintroduce the very problem around “too many is_stmt instructions” which you are trying to solve

I wonder if I am misinterpreting the DWARF standard here? Is there some smart way to reach both our goals at the same time? Or are our two goals just fundamentally incomatible? :thinking:

(PS: you can find my current implementation without using is_stmt in
[lldb-dap] Support column breakpoints by vogelsgesang · Pull Request #113787 · llvm/llvm-project · GitHub)

1 Like

Thanks for commenting, this is an interesting point!

In both your first example and in the blog post linked in your PR, the expressions only contain function calls. Do you want to enable column-level breakpoints at calls only, or also non-call operations in expressions?

I ask because with our approach, calls specifically always get is_stmt applied (we think calls are “interesting source atoms”, so we don’t want to skip over them while stepping). So there may be no conflict of approaches here?

Yeah, that’s right (I haven’t tested it, but that would be the intended behaviour).

(See above about calls)

Are you interested in unoptimized- or optimized-code debugging in this case (or both)?


@labath raised a similar point at the US LLVM conference for O0 builds. That it’s desirable to be able to stop before any of the expression has been executed, so variables can be adjusted. e.g. int c = a + b; a user might want to modify b before the + is exectued.

I think that being able to specifically stop at the + would be difficult (by the time you’ve stopped at the + some variables might have been loaded into registers which may not be reflected in the O0 variable location), but stopping before any of the expression has evaluated is the current behaviour.

So I’m reconsidering my initial position that we want to enable this at O0 by default (I still think it’s a big win for optimized builds).

After lambda bodies, function calls are probably the most important use case. Besides explicit function calls, also “implicit” calls like overloaded + operators would be useful to have column breakpoints on. Would you also consider those as “interesting source atoms”?

Ah, great. I missed that detail in your initial post. Then there is probably indeed no goal conflict here, and both approaches would combine nicely :slightly_smiling_face:

Preferably both - I will take what I can get :smiley:

Interestingly enough, with the current implementation in [lldb-dap] Support column breakpoints by vogelsgesang · Pull Request #113787 · llvm/llvm-project · GitHub, this works. I am simply offering all columns where there is any debugging info available as a potential breakpoint, independent of whether it is marked as is_stmt or not.

If the is_stmt implementation gets improved to actually indicate the interesting source atoms, I would probably introduce a setting which allows the user to choose between “display interesting column breakpoints only” and “offer all potential column breakpoints”.

Why? I couldn’t find a comment above which would speak against enabling this also in O0 builds. Or did I miss something?

I think those improved is_stmt annotations would also make sense in O0

1 Like

The issue in this case is that with this feature, the instruction that gets used for breakpoints is the instruction that completes the semantic effect of the source statement, rather than the first instruction associated with that statement. This means that for a simple source statement like int c = a + b;, currently at O0 the breakpoint is set at the first instruction, which would probably be loading the value of a or b; this means that at that breakpoint you could modify the value of a or b in the debugger and the modified value would be used to compute c. With this feature, the breakpoint would instead be set on the store to c, which would mean that modifying a or b in the debugger would have no effect on the value of c.

I think this is an unfortunate result, but I personally would prefer to have it than to use different is_stmt logic in unoptimized builds compared to optimized builds. The effect is somewhat visible to the user if column info is enabled, since the breakpoint will be on the = rather than within the RHS expression. It’s also possible that we could find a reasonable workaround for this, either with a compiler flag to change behaviour or a debugger feature (e.g. allowing users to set line breakpoints that ignore is_stmt); my perspective is mostly centered on optimized debugging however, so it would be good to hear more opinions from anyone for whom this is a more direct concern!

What we have here (whether to use KI at O0) is a UX question rather than a correctness question. KI is intended to be a solution to a problem that manifests with optimization, and make the debugging experience smoother.

Using KI at O0 is arguably a solution in search of a problem. If the customary experience of O0 debugging changes, and not in a good way (as in the c = a + b example), I’m hard pressed to call that a positive outcome.

Using KI at O0 is arguably a solution in search of a problem.

My understanding is that the enabling this feature also at O0 would improve the UX for column-level breakpoints. That’s at least how I understood

The problem of whether to mark the first or the last instruction of an “interesting source atom” via is_stmt could be treated independently. E.g., under -O0 we could mark the first instruction of an interesting atom as is_stmt and under -O3 the last instruction.

Or am I missing something?

The O0 behaviour question is an interesting one and I welcome all this discussion. I’ve got a meeting with our debugger team next week to discuss these sorts of things, so I’ll hold off on vending a strong opinion until then. Still happy to explore ideas though.

My understanding is that the enabling this feature also at O0 would improve the UX for column-level breakpoints. That’s at least how I understood

From your examples, I believe so too. Specifically, single lines with multiple statements and lines with multiple calls.

E.g., under -O0 we could mark the first instruction of an interesting atom as is_stmt and under -O3 the last instruction.

Our debugger team had an interesting thought here - is this possibly less a question of O0 (/optnone) and more of a dynamic property of optimisation.

If in some cases we want to mark the first instruction and sometimes the last, is there a heuristic we can use to determine that, without inspecting the optimisation flags? For example: if the atom’s instructions are contiguous then put is_stmt at the start, else at the end. I think sounds plausible, but haven’t put a great deal of thought into that yet.

I was actually thinking about something similar (I also work on a debugger team :P).

It sounds to me like we don’t necessarily care about stopping directly at the key instruction. Stopping a few instructions before it (assuming they’re all on the same line and everything) should be fine as well.

So, maybe, if we take each contiguous sequence of instructions belonging to one atom (is that the right word?) and just always place the is_stmt flag on the first instruction in that sequence, then we might get the -O0 behavior as a side-effect? (*)

At the same time, I agree with what Paul said about searching for a problem, and I think we should enable this at -O0 only if we’re sure it won’t impact the experience negatively. Nonetheless I think there is some room for improvement at -O0, and this sounds like it might deliver it:

Currently, if we have a function call formatted like so:

foo(a,
    b,
    c,
    d);

then we will stop at every line of that function call (presumably, because it contains a load instruction), even though those “statements” aren’t really interesting, If something like this could make us stop at this function call only once (which I think is what will happen at -O3 with this RFC) then I think that would be a positive outcome.

(*) Although, if that is all we do, then according to Shannon’s information law :P, this modification does not carry any information, and we might be able to let the debugger do it.

2 Likes

It probably would. But not everyone will want column-level breakpoints? This is part of what I’m getting at with it being a UX question rather than correctness question.

Moving is_stmt up to the first instruction in the atom containing the KI, rather than on the KI itself, might help address this point. The scope of the atom could vary depending on whether we had column-level breakpoints enabled, so this movement would or would not cross a column change depending.

1 Like

Rather than just thinking about it as a way to get O0 behaviour as a side effect, I think we should also consider other optimisation levels such as O1 and (ideal) Og. Using the heuristic would let us choose the best option at a source-atom-granularity, rather than optimisation-level-granularity, which could further positively impact all optimisation levels.

That’s right, with Key Instructions we’d just stop at this source atom once (ideally at the first load at O0 as discussed).

And this was already mentioned, but just to reiterate, if we instead had:

foo(a,
    bar(),
    c,
    d);

we would consider bar() a separate source atom so it can still be stepped on.