[RFC] Instruction API changes needed to eliminate debug intrinsics from IR

Hi,

This is a proposal for some changes to LLVM’s instruction movement/insertion APIs to better describe the movement of debug-info. In order to remove debug intrinsics from IR we need to first improve the expressiveness of the instruction-moving APIs. This RFC proposes a way to achieve that, with a couple of remaining question marks.

A while back in Prototyping a not-an-instruction dbg.value I suggested a pathway towards storing variable location debug-info in something that wasn’t an intrinsic. The benefits are faster -g compiles, and more robust compilation as the presence of debug-intrinsics can interfere with optimisations. We’ve got a prototype at [0] that can build a clang-3.4 binary with no dbg.value intrinsics and produces identical object files [1] – it’s not a complete solution, but we reckon it exposes all of the important design problems. Today I’d like to focus on a single issue: the fact that debug intrinsics have instruction iterators (BasicBlock::iterator) which are functionally important to debug-info, and they won’t exist if we cease using intrinsics. We currently consider any instruction to be “a position in the function”, where a debug-info intrinsic is a position in between two “real” instructions. Without debug intrinsics, we won’t be able to describe “in between” positions any more, we would have to attach the debug-info to instructions themselves which raises more issues. To illustrate, consider this block:

  block:
    dbg.value(...)
    %1 = add...
    dbg.value(...)
    %2 = sub...
    br label %exit

In today’s LLVM, each instruction has an iterator, therefore we can describe any range of “real” instructions and debug-info instructions to be spliced from one block to another. However, if the debug-info was attached to instructions, we can only have iterators that point at the add, sub, and br. That’s strictly less expressive than what we have today, and this causes meaningful differences in debug-info.

I’ve got two problems to illustrate with examples: the movement of instructions with debug-info attached, and inserting new instructions at the start of blocks.

Instruction::moveBefore can be used to move any instruction from one place to another, without any consideration for debug-info. This isn’t a problem today: if code moves a single instruction then it usually doesn’t want debug-info to move too, while code like moveBBContents [2] just moves every instruction from one place to another including debug instructions. However if we attach debug-info to instructions, moveBefore would become responsible for moving the debug-info, and whether it should move or not can be ambiguous. Consider today’s IR:

  bb1:
      dbg.value(...)
      %foo = add %1, %2
      %bar = sub %3, %4

Assume that we can attach the dbg.value information to the add instruction in some way. Now imagine that we called moveBefore, moving the add instruction to another block. If for example we are hoisting the add during CSE then the debug-info should not move with the instruction, it should stay in the source block. However if we’re in a loop like moveBBContents [2] moving all instructions in a block, then it’s necessary for moveBefore to move the debug-info to the destination block. Currently, there’s no way in LLVM to distinguish the two use cases.

The other scenario is inserting one instruction in front of an existing instruction. Should debug-info attached to the existing instruction come before or after the new instruction? We can describe both when debug-info has iterators, but will be forced to pick one if debug-info is attached to instructions. Unfortunately there is no one-size-fits-all solution: in the majority of situations, LLVM today inserts instructions after nearby debug-info intrinsics. However there are other situations where the opposite is needed: for example in MergeBasicBlockIntoOnlyPred [3] where one block is spliced into another, the predecessors block’s instructions should always come before the successor’s debug-info: the blocks are effectively concatenated. Today, this is unambiguous because BasicBlock::begin and getFirstInsertionPt will return iterators to debug-info intrinsics. However, if we don’t use intrinsics for debug-info, there are no means to describe inserting before the debug-info.

~

There’s a fairly simple solution for the first problem, which is to introduce a specialised moveBefore method that moves debug-info too, and update all the call sites in LLVM that need that behaviour. This is achievable as there’s only about twenty call sites that behave in that way. In the prototype linked below the relevant method is called moveBeforePreserving, we renamed moveBefore to moveBeforeBreaking to make the differences explicit. I think it’s useful to have the developer express the “disposition” of the move they’re making, whether it preserves the flow of instructions and thus should transfer debug-info, or whether it breaks the flow.

However, it’s much harder for the insert-at-block-begin example: it’s a much more common code pattern, with between 100 to 200 call sites where this distinction has an actual effect on variable locations. We’ve explored and implemented a solution: store whether an iterator came from begin / getFirstInsertionPt in a bit inside the BasicBlock::iterator object, aka ilist_iterator [4]. This feels bad because it’s taking a value type that’s a single pointer right now and adding more data; however I think it’s inescapable. Iterators are exactly the right design pattern for storing information about positions in a collection. We could:

  • Add the distinction of an iterator being returned by begin or getFirstInsertionPt to the type returned: however that feels ugly and will make ugly compile errors,
  • Add a flag parameter to instruction insertion methods, which is unergonomic and error prone,
  • Have the fact that an iterator was intended to be at the start of the block (before debug-info) signalled at runtime, which requires information in the iterator object.

It’s the latter that we’ve used in our prototype – a variety of instruction insertion methods have had overloads added that take iterators, and then call sites that use begin or getFirstInsertionPt (or getFirstNonPHI) now pass an iterator for the position to insert instructions. This is straightforwards; for additional safety the instruction-accepting insertion APIs could be removed, forcing everyone to use iterators when inserting. That’s invasive, but IMHO a reasonable price to pay for clearer management of debug-info. We’re using two bits in the prototype: because a pair of instruction iterators are a half-open range of instructions, we use the extra bits to signal whether the iterators are inclusive of the debug info attached at the start and end. Coming back to our earlier example:

  bb2:
    dbg.value(...)
    %1 = add...
    dbg.value(...)
    %2 = sub...
    br label %exit

Imagine we have an iterator range from %1 to %2. If we were to splice that range into another block, today LLVM would transfer the add instruction and the following dbg.value. If we attached debug-info to instructions and didn’t have debug intrinsics, that would be the only transfer we could describe. Thus, the two iterator bits signal:

  • For the “first” iterator, whether debug-info attached to the first instruction should be transferred too,
  • For the “last” iterator, whether debug-info attached to the last position should be transferred too,

respectively. That allows us to describe almost all the transfers that could be performed when debug-info is an intrinsic: both dbg.values and the add instruction, or just the add instruction by itself. I think it’s noteworthy that with accessors like begin and getFirstInsertionPt returning iterators with the bits set already, we only found two sites in LLVM where those bits need to be manually updated.

In terms of performance: adding the bit to BasicBlock::iterator has a small compile-time cost to no-debug-info builds (0.05%) according to [5]. However this is going to be paid for by eliminating calls to functions like getNextNonDebugInstruction, which recovers almost 0.3% of compile time [6]. This is definitely not the final word on compile-time performance, but I want to indicate the costs aren’t going to be overwhelming. Change in memory usage is negligible too as we typically don’t store iterators in data structures.

As proof that this technique can work, I’d like to present our prototype [0] (based on llvm-15) as evidence – please ignore the implementation details of what we’re doing to blocks, instructions and storage of debug-info, those are all very changeable. My major concern is how the instruction API changes, how this fans out into the rest of the compiler, and whether it’s a tolerable burden for other compiler developers. Thus, I think it’s noteworthy that we’re “only” touching 200 files, and the vast majority of changes are changing the spelling of moveBefore or passing an iterator into a splice/insertion method.

There’s an additional problem with eliminating iterators, blocks can be temporarily empty (no terminator) but still contain debug-instructions, that should then go in front of any terminator added later. Our solution to that so far is to special-case the insertion of a terminator into a block, which is sufficient to fixup any out-of-place debug-info.

Feedback most welcome: specifically for the two changes that would be needed for the instruction API:

  • The need to use a moveBefore method that explicitly states the intention of the developer wrt. whether debug-info should be moved too.
  • Sticking some extra bits in BasicBlock::iterator to signal at runtime whether a position in a block comes before or after the nearby debug-info.

To provide context, the other things we’re working on are the storage model for variable-location information, and what kind of maintenance would need to be applied during optimisations. We haven’t thought about representing these things in bitcode / textual IR and I’d prefer to avoid thinking about it for now.

Cheers to @StephenTozer for ploughing through a lot of this. CC the usual debug-info folks: @adrian.prantl @dblaikie @rnk @pogo59 @cmtice @OCHyams @jryans @slinder1 , although I think this is probably relevant to the interests of all pass-authors.

[0] Publish our prototype "killing debug-intrinsics" diff by jmorse · Pull Request #1 · jmorse/llvm-project · GitHub
[1] CommandLine.cpp in clang-3.4 bakes the build time into itself, all the other files are identical.
[2] llvm-project/IROutliner.cpp at c42eda5d3692fcd67fc3d043ab37f1950ea653b9 · llvm/llvm-project · GitHub
[3] llvm-project/Local.cpp at a33f018b89c07e0728539b34c158e88a7db49982 · llvm/llvm-project · GitHub
[4] Cue gasps of horror from the audience!
[5] LLVM Compile-Time Tracker
[6] LLVM Compile-Time Tracker NOTE: the -g modes are not meaningful as all debug-info is dropped.

6 Likes

This sounds great, thank you for the write-up!

To recap (so you can tell me if I’m missing something):

  • The current API essentially has this distinction between “preserving”/“breaking” already, it just is not explicitly recorded anywhere because the debug markers are actually Instructions, and so the behavior we want essentially falls out: you generally move all instructions when you are “preserving” and a single instruction when you are “breaking”.
    • An example where debug-info should move with instructions is e.g. inlining. When we move many instructions at once, we naturally choose to bring along the intersperced debug-info instructions. This corresponds to the “preservering” disposition.
    • An example where debug-info should not move with instructions is e.g. CSE. The debug-info describing when e.g. an assignment occurs doesn’t want to be moved relative to the rest of the instructions just because CSE hoisted one of them. This corresponds to the “breaking” disposition.

Assuming I’m following so far, the choice to move the high-level disposition into the iterator seems reasonable to me. If there is not a significant runtime performance hit then I agree doing this in the type system in C++ doesn’t sound appealing.

I think I do get a little lost when trying to understand the piece about begin/getFirstInsertionPt. In the example of:

bb2:
    dbg.value(A, ...)
    %1 = add...
    dbg.value(B, ...)
    %2 = sub...
    br label %exit

Does this translate to something like:

bb2:
    %1 = add... !dbg.attachments {dbg.value(A, ...)}
    %2 = sub... !dbg.attachments {dbg.value(B, ...)}
    br label %exit

?

And so the debug-info “of” a particular instruction takes effect before it?

If so, then I don’t understand the example of MergeBasicBlockIntoOnlyPred. It seems like it would naturally fall out that DestBB->splice(DestBB->begin(), PredBB) would always result in the whole PredBB including its debug info coming before the first instruction and first debug info of DestBB. Maybe that’s your point, as it is using begin, but then I don’t understand in what situation another behavior is possible.

I admittedly have not spent the time to look at the code yet, so sorry if this is an obvious question!

The current API essentially has this distinction between “preserving”/“breaking” already, it just is not explicitly recorded anywhere because the debug markers are actually Instruction s, and so the behavior we want essentially falls out:

Yup, that’s exactly it, and the two examples are correct too,

Does this translate to something like […] and so the debug-info “of” a particular instruction takes effect before it?

That’s correct – I’ve been trying to avoid any textual representation because that’s a discussion I want to avoid, but what you wrote is a fair illustration of the information that would be stored.

If so, then I don’t understand the example of MergeBasicBlockIntoOnlyPred . It seems like it would naturally fall out that DestBB->splice(DestBB->begin(), PredBB) would always result in the whole PredBB including its debug info coming before the first instruction and first debug info of DestBB . Maybe that’s your point, as it is using begin , but then I don’t understand in what situation another behavior is possible.

Taking a quick survey of the calls to “blockSplice” in the prototype code, half of the call sites insert at either begin() or end(), while the other half insert at a specific instruction. There are a few examples in InlineFunction.cpp where groups of allocas are shuffled around, or single blocks are directly inlined into the call-site. In those scenarios, the splice function needs to insert instructions after any attached debug-info, if we had:

bb1:
  %2 = add i32 %1, %0
  call void @foo(%2), !dbg.attachments {dbg.value(A,...)}
  br label %bar

And inlined the contents of foo into the block at the position of the call instruction, the attachments should be moved onto the first instruction inlined so that they’re visible before the call site. If that didn’t happen, they would sink down to the branch instruction when the call was deleted, and this wouldn’t accurately represent the source program. Hence, there’s a need for the splice function know whether the caller wants the moving instructions to come before or after the debug-info at the insertion point.

That being said: there are only about 15 sites where the blockSplice method is actually needed, it wouldn’t be a huge burden to expect callers to express this information as an argument, instead of as information baked into the iterator. The larger problem is several hundred call sites where instructions are inserted at the start of a block with something like moveBefore, and the pass author has made a decision about whether the instruction should come before or after debug intrinsics by calling begin/getFirstInsertionPt for inserting before, or getFirstNonPHIOrDbg for inserting after. When we have debug intrinsics with iterators, that distinction is communicated to the inserting function by the position being a debug-instruction (or not), in the prototype it’s communicated by the bits in the iterator object.

These decisions (insert at the start of the block, before or after debug-info?) have to be preserved to get an identical binary, however it’s not clear whether they’re necessary for preserving the quality of debug-info. Perhaps it doesn’t matter whether a new instruction generated by SCEV goes before or after variable assignments; on the other hand if we sink an instruction from a parent to a successor block then we might care what order it has with the variable assignments in the successor. Baking this information into the iterator avoids having to answer those questions which may seem like an evasion, but on the other hand it guarantees no regressions.

Ah, that makes sense, thank you! I think I just needed to spend a bit more time to get to the cases you were referring to, but took the lazy way out and just asked. Thank you for humoring me :slight_smile:

I definitely see the benefit of retaining the information implied at the creation of the iterator. It seems the most elegant and least error-prone option.

I really like the approach of starting from true “NFC”, and I’m a bit shocked you were able to achieve it! I also particularly enjoy the “inhale”/“exhale” approach around serialization to defer updating tests. I will have to spend a bit more time reading the code, but I’m fairly convinced of the approach.

[Thanks Scott,]

Here’s an outline of how we’d like to move this forwards; we can put a lot of non-invasive changes in tree with zero impact on the rest of LLVM, which makes development easier for us. We’re continuing to iterate on the data-structures used for non-instruction debug-info, it’s at the point where -g compiles are a little faster (while still doing a lot of redundant work), more news on that in the future.

  1. We can fix a variety of debug-info-affects-codegen faults we’ve found in this process,
  2. Add iterator-accepting insertion handlers and utilities, for the most part this is only changing whether an iterator is dereferenced or getIterator is called on an instruction,
  3. Channel all block splicing through a single function (this has largely been done on main by others already),
  4. Use a specialised moveBefore method with the “moving debug-info” disposition at about 20 locations in LLVM

That’ll reduce the size of delta we have from ‘main’ to roughly 30 files, which is much easier to manage. Once we’ve got a Good ™ in-memory representation, we’ll make another proposal about those and:

  1. Add the relevant files and plumbing to the llvm/lib/IR directory,
  2. Add parallel implementations of the 15 to 20 places that directly update dbg.values today,
  3. Add the relevant bits to BasicBlock::iterator as mentioned above.

This will allow for more equivalence testing and experimentation by other developers, plus evaluation of the compile-time benefits. The final steps would be working out how to represent this information in textual IR and bitcode, plus update the 1000 tests containing dbg.values.

1 Like

Hello folks – FYI I gave a presentation at EuroLLVM illustrating the ideas behind the proposal above, and exploring where the costs are, I’ll link to that from here when the videos are up,

I’ve started uploading our patches to Phabricator, forming a stack based on [0]. If you look at the five patches based on that, you’ll see the extent of the invasiveness of these changes: basically turning a bunch of things that were instruction pointers into iterators. I don’t think this is especially remarkable as it’s effectively a spelling change, although you might say that the iterator form is slightly more verbose.

Technically these changes are NFC (aside from D151419 itself) – however they won’t be NFC in the context of a debug-info-intrinsic free LLVM, because they will be transmitting more information about the debug-info context through the BasicBlock::iterator object once that happens. This raises an annoying question of how do we go about testing these changes. So far I’ve been comparing the output of a compile with no intrinsics, with one that does use intrinsics, to determine differences from the “correct” output: however this isn’t a long term strategy as we want to get rid of intrinsics entirely. It’s also infeasible to write a test for each changed code path because there are going to be hundreds of them.

If we’re willing to accept that with these instruction API changes the only way to insert instructions is going to be with an iterator, then the changes I’ve linked to don’t need to be tested: they’re just the first tranche of changes needed in replacing one set of APIs with another. That’s my prefered direction of travel / interpretation – no-one has yet objected to the idea of requiring insertion with a BasicBlock::iterator, or the need to store an extra bit in the iterator class. If you have an objection to this idea, please do speak up now!

Testing for the correct placement of variable location information is an open problem as we don’t have comprehensive testing of debug-info in the context of every optimisation transformation.

Going forwards: the next set of patches to put in that stack will be for a naive implementation of storing variable assignment information outside of instructions. I’d prefer to avoid premature optimisation at this stage – we’re still optimising it, but the validity of the approach can be tested / evaluated before patches to make it fast are iterated on and landed. The final set of patches are per-pass extra instrumentation needed for scenarios where debug-info has special handling (i.e. inlining etc).

[0] https://reviews.llvm.org/D151419

1 Like

I think a basically mechanical change from pointers to iterators should be NFC and not require testing. That sort of thing happens all the time.

I suspect that the conversion from LDI (legacy debug intrinsics) to NDI (new debug info) may have to suffer through similar conversion efforts as the pass-manager and opaque-pointer changes; over some period, we’ll have to support both, and work through all the existing tests to convert them over. I suspect it will go faster than the previous changes because debug info tests are much less pervasive. Or possibly we won’t have to support both, but at least accept debug intrinsics on IR input and handle them in some reasonable way. (Will need to have that for IR auto-upgrade anyhow.)

One thing that will have to happen, and I know you’ve been wanting to defer it, is to define the textual IR representation of NDI so that it can in fact be tested by something short of producing a final object file and dumping the DWARF/CodeView.

Hi,

The EuroLLVM talk I gave on this topic is up [0], the tl;dr is “let’s capture more information about the intention of optimisation passes moving instructions around, so that we can maintain debug-info automatically”. I’ve now uploaded the core parts of these changes, elaborated below. The net effect is that the position of debug-info records can be preserved as if they were instructions, without them actually being stored as instructions – it’s now time to talk about the /other/ hard part of removing debug intrinsics, which is trading slightly increased memory consumption in normal builds for lower consumption + faster compilation in -g builds. (I’ll talk about that in this topic to avoid fragmenting discussion).

In the patches linked below, we’ve added a new pointer field to class Instruction to reference any debug-info “on” the Instructions position. There’s also a new field in BasicBlock to handle scenarios where blocks transiently contain only debug-info [1]. The memory-cost impact of this over builds of an older clang is (a median of) 0.6% more max-RSS for release builds (-O3, no debug-info). There’s broadly no change in memory usage for debug-info builds, although I’m confident we’ll be able to make improvements in some time, we haven’t tried yet. In return, there are compile-time improvements for debug-info builds, see further below. To me and my use case, this is an obviously beneficial trade-off: debug-info builds always take longer and uses more memory, and developers always end up debugging their code, so that’s always the eventual worst case for all software. Trading more memory on a release build to reduce the worst-case debug-info compile time is great. I imagine not everyone agrees though, so I’d like to draw attention to it and ask… is this alright?

In terms of compile time costs: right now on the compile-time-tracker [2] there’s a ~2.5% geomean speedup for -g -flto. Some speedups are disproportionately large, tramp3d-v4 is almost 10% faster. The large C++ code-base mentioned in the talk where we attributed 30% of LTO compile-time to variable locations speeds up about 6%. As mentioned elsewhere, this is a naive / simple implementation, and I think in the mid term we can do a lot better than a 2.5% speedup. There’s a whole bunch of other things we could do with debug-info connections not being in the Value/Metadata hierarchy such as:

  • Hard-coding constant-value variable assignments, there’s no need for them to be part of a large metadata use list,
  • Allocate blocks of debug-info records together: they should never be deleted during optimisation, and nothing should ever be inserted in the middle, so they don’t need to be individually re-orderable or freeable,
  • Almost one-quarter of variables are totally undef by the end of compilation, if we could index them better, we could delete them earlier.
  • Pooling metadata references (maybe, unsure) as typically debug-info records move in packs.
    but this is the first step towards exploring that design space.

There’s also a 0.7% slowdown for normal builds: some of this is because we’re paying the branching cost of debug-info maintenance twice, we repeatedly check whether instructions are debug-info or not even when there aren’t any. A previous experiment disabling getNextNonDebugInfoInstruction and debug-info-iterators demonstrated a 0.3% speedup for normal builds [3], there may be more improvements to redeem, we still need to dig through that. Exactly how to stage this into the repo to avoid everyone paying that cost isn’t clear yet.

For the patches themselves, here’s a quick summary of the main five:

Here’s an illustration of the connections between these data structures:

    Instruction>>>>>>>>>>Instruction>>>>>>>>>>Instruction
         |                                         |
      DPMarker                                  DPMarker
       /   \                                     /   \
      /     \                                   /     \
  DPValue  DPValue                          DPValue  DPValue

Instructions (which are now never dbg.values) point at an optional DPMarker, which itself contains a list of DPValues. The markers represent a position in the program, while the DPValues record a Value and details about source variables, much like dbg.values.

There are a bunch of additional, small(er) patches, every time that dbg.value intrinsics are updated needs additional instrumentation. These are pretty straight forwards though and aren’t as important as thinking about data structures.

Patch reviews would be most welcome, or feedback / opinions on the overall direction we’re taking. If we can get some or all of this in-tree and controlled by a cmake flag, it’ll enable other developers to make their own evaluations of the costs and benefits.

Is anyone opposed to these trade offs? To summarise, according to CTMark there’s a reduction of 2.5% compile time for LTO with debug info builds at the cost of an increase of 0.6% max RSS in release builds. We also remove debug instructions entirely (which will ease maintenance of all optimisation passes) for the cost of maintaining both systems until the work is completed.

[0] 2023 EuroLLVM - What would it take to remove debug intrinsics? - YouTube
[1] This can probably be shuffled somewhere else due to it’s rare use, something like a DenseMap in Function.
[2] LLVM Compile-Time Tracker
[3] LLVM Compile-Time Tracker (don’t look at the -g runs, it was hard-coded off).


Thanks,
Jeremy

5 Likes

Hi folks,

An update on this – the major plumbing to make this all work is now landing / has landed, with anything that might cause a performance regression guarded by a cmake flag, LLVM_EXPERIMENTAL_DEBUGINFO_ITERATORS. With D154372 landed, LLVM will automagically transform debug-info into a non-instruction representation for the duration of optimisation passes if you pass -mllvm --experimental-debuginfo-iterators=true. This doesn’t achieve very much with what’s in-tree right now though.

The next step is installing a little more plumbing, followed by updating various optimisation passes to support transformation/maintenance of DPValues, the replacement for dbg.value intrinsics. This is going to take the form of parallel implementations of various functions using the new types and objects. On the one hand it feels redundant to just copy+paste a lot of code, but I’d like to view it as a feature: the basic principles behind how LLVM handles debug-info aren’t changing, just some of the types, hence we need duplicate implementations during a transitional period [0].

Making sure this duplicate code gets tested is important to avoid it immediately rotting. We’ve got a buildbot [1] running with the CMake flag enabled on it, and I’m planning on landing this patch [2] that enables the new-debug-info mode only if it’s built in. That’ll allow us to duplicate some RUN-lines in tests with that flag, meaning we get some test coverage via that buildbot.

Once those patches are all up I’ll post a call-for-testing with instructions on how to test out the debug-info stuff, and what kind of testing is needed. Most of what we’ve been doing is building programs, observing differences in the debug-info output, then letting llvm-reduce narrow down where the difference is for us, so it’s not labour intensive.

@StephenTozer is cooking up some patches for changing the textual IR format [3], more news on that shortly.

[0] Hopefully a few months not years.
[1] Buildbot
[2] [DebugInfo][RemoveDIs] Add flag to use "new" debug-info in opt by jmorse · Pull Request #71937 · llvm/llvm-project · GitHub
[3] [RFC][DebugInfo] Proposed changes to the textual IR representation for debug values


Thanks,
Jeremy

1 Like