[RFC] Instruction API changes needed to eliminate debug intrinsics from IR

Hi,

This is a proposal for some changes to LLVM’s instruction movement/insertion APIs to better describe the movement of debug-info. In order to remove debug intrinsics from IR we need to first improve the expressiveness of the instruction-moving APIs. This RFC proposes a way to achieve that, with a couple of remaining question marks.

A while back in Prototyping a not-an-instruction dbg.value I suggested a pathway towards storing variable location debug-info in something that wasn’t an intrinsic. The benefits are faster -g compiles, and more robust compilation as the presence of debug-intrinsics can interfere with optimisations. We’ve got a prototype at [0] that can build a clang-3.4 binary with no dbg.value intrinsics and produces identical object files [1] – it’s not a complete solution, but we reckon it exposes all of the important design problems. Today I’d like to focus on a single issue: the fact that debug intrinsics have instruction iterators (BasicBlock::iterator) which are functionally important to debug-info, and they won’t exist if we cease using intrinsics. We currently consider any instruction to be “a position in the function”, where a debug-info intrinsic is a position in between two “real” instructions. Without debug intrinsics, we won’t be able to describe “in between” positions any more, we would have to attach the debug-info to instructions themselves which raises more issues. To illustrate, consider this block:

  block:
    dbg.value(...)
    %1 = add...
    dbg.value(...)
    %2 = sub...
    br label %exit

In today’s LLVM, each instruction has an iterator, therefore we can describe any range of “real” instructions and debug-info instructions to be spliced from one block to another. However, if the debug-info was attached to instructions, we can only have iterators that point at the add, sub, and br. That’s strictly less expressive than what we have today, and this causes meaningful differences in debug-info.

I’ve got two problems to illustrate with examples: the movement of instructions with debug-info attached, and inserting new instructions at the start of blocks.

Instruction::moveBefore can be used to move any instruction from one place to another, without any consideration for debug-info. This isn’t a problem today: if code moves a single instruction then it usually doesn’t want debug-info to move too, while code like moveBBContents [2] just moves every instruction from one place to another including debug instructions. However if we attach debug-info to instructions, moveBefore would become responsible for moving the debug-info, and whether it should move or not can be ambiguous. Consider today’s IR:

  bb1:
      dbg.value(...)
      %foo = add %1, %2
      %bar = sub %3, %4

Assume that we can attach the dbg.value information to the add instruction in some way. Now imagine that we called moveBefore, moving the add instruction to another block. If for example we are hoisting the add during CSE then the debug-info should not move with the instruction, it should stay in the source block. However if we’re in a loop like moveBBContents [2] moving all instructions in a block, then it’s necessary for moveBefore to move the debug-info to the destination block. Currently, there’s no way in LLVM to distinguish the two use cases.

The other scenario is inserting one instruction in front of an existing instruction. Should debug-info attached to the existing instruction come before or after the new instruction? We can describe both when debug-info has iterators, but will be forced to pick one if debug-info is attached to instructions. Unfortunately there is no one-size-fits-all solution: in the majority of situations, LLVM today inserts instructions after nearby debug-info intrinsics. However there are other situations where the opposite is needed: for example in MergeBasicBlockIntoOnlyPred [3] where one block is spliced into another, the predecessors block’s instructions should always come before the successor’s debug-info: the blocks are effectively concatenated. Today, this is unambiguous because BasicBlock::begin and getFirstInsertionPt will return iterators to debug-info intrinsics. However, if we don’t use intrinsics for debug-info, there are no means to describe inserting before the debug-info.

~

There’s a fairly simple solution for the first problem, which is to introduce a specialised moveBefore method that moves debug-info too, and update all the call sites in LLVM that need that behaviour. This is achievable as there’s only about twenty call sites that behave in that way. In the prototype linked below the relevant method is called moveBeforePreserving, we renamed moveBefore to moveBeforeBreaking to make the differences explicit. I think it’s useful to have the developer express the “disposition” of the move they’re making, whether it preserves the flow of instructions and thus should transfer debug-info, or whether it breaks the flow.

However, it’s much harder for the insert-at-block-begin example: it’s a much more common code pattern, with between 100 to 200 call sites where this distinction has an actual effect on variable locations. We’ve explored and implemented a solution: store whether an iterator came from begin / getFirstInsertionPt in a bit inside the BasicBlock::iterator object, aka ilist_iterator [4]. This feels bad because it’s taking a value type that’s a single pointer right now and adding more data; however I think it’s inescapable. Iterators are exactly the right design pattern for storing information about positions in a collection. We could:

  • Add the distinction of an iterator being returned by begin or getFirstInsertionPt to the type returned: however that feels ugly and will make ugly compile errors,
  • Add a flag parameter to instruction insertion methods, which is unergonomic and error prone,
  • Have the fact that an iterator was intended to be at the start of the block (before debug-info) signalled at runtime, which requires information in the iterator object.

It’s the latter that we’ve used in our prototype – a variety of instruction insertion methods have had overloads added that take iterators, and then call sites that use begin or getFirstInsertionPt (or getFirstNonPHI) now pass an iterator for the position to insert instructions. This is straightforwards; for additional safety the instruction-accepting insertion APIs could be removed, forcing everyone to use iterators when inserting. That’s invasive, but IMHO a reasonable price to pay for clearer management of debug-info. We’re using two bits in the prototype: because a pair of instruction iterators are a half-open range of instructions, we use the extra bits to signal whether the iterators are inclusive of the debug info attached at the start and end. Coming back to our earlier example:

  bb2:
    dbg.value(...)
    %1 = add...
    dbg.value(...)
    %2 = sub...
    br label %exit

Imagine we have an iterator range from %1 to %2. If we were to splice that range into another block, today LLVM would transfer the add instruction and the following dbg.value. If we attached debug-info to instructions and didn’t have debug intrinsics, that would be the only transfer we could describe. Thus, the two iterator bits signal:

  • For the “first” iterator, whether debug-info attached to the first instruction should be transferred too,
  • For the “last” iterator, whether debug-info attached to the last position should be transferred too,

respectively. That allows us to describe almost all the transfers that could be performed when debug-info is an intrinsic: both dbg.values and the add instruction, or just the add instruction by itself. I think it’s noteworthy that with accessors like begin and getFirstInsertionPt returning iterators with the bits set already, we only found two sites in LLVM where those bits need to be manually updated.

In terms of performance: adding the bit to BasicBlock::iterator has a small compile-time cost to no-debug-info builds (0.05%) according to [5]. However this is going to be paid for by eliminating calls to functions like getNextNonDebugInstruction, which recovers almost 0.3% of compile time [6]. This is definitely not the final word on compile-time performance, but I want to indicate the costs aren’t going to be overwhelming. Change in memory usage is negligible too as we typically don’t store iterators in data structures.

As proof that this technique can work, I’d like to present our prototype [0] (based on llvm-15) as evidence – please ignore the implementation details of what we’re doing to blocks, instructions and storage of debug-info, those are all very changeable. My major concern is how the instruction API changes, how this fans out into the rest of the compiler, and whether it’s a tolerable burden for other compiler developers. Thus, I think it’s noteworthy that we’re “only” touching 200 files, and the vast majority of changes are changing the spelling of moveBefore or passing an iterator into a splice/insertion method.

There’s an additional problem with eliminating iterators, blocks can be temporarily empty (no terminator) but still contain debug-instructions, that should then go in front of any terminator added later. Our solution to that so far is to special-case the insertion of a terminator into a block, which is sufficient to fixup any out-of-place debug-info.

Feedback most welcome: specifically for the two changes that would be needed for the instruction API:

  • The need to use a moveBefore method that explicitly states the intention of the developer wrt. whether debug-info should be moved too.
  • Sticking some extra bits in BasicBlock::iterator to signal at runtime whether a position in a block comes before or after the nearby debug-info.

To provide context, the other things we’re working on are the storage model for variable-location information, and what kind of maintenance would need to be applied during optimisations. We haven’t thought about representing these things in bitcode / textual IR and I’d prefer to avoid thinking about it for now.

Cheers to @StephenTozer for ploughing through a lot of this. CC the usual debug-info folks: @adrian.prantl @dblaikie @rnk @pogo59 @cmtice @OCHyams @jryans @slinder1 , although I think this is probably relevant to the interests of all pass-authors.

[0] Publish our prototype "killing debug-intrinsics" diff by jmorse · Pull Request #1 · jmorse/llvm-project · GitHub
[1] CommandLine.cpp in clang-3.4 bakes the build time into itself, all the other files are identical.
[2] llvm-project/IROutliner.cpp at c42eda5d3692fcd67fc3d043ab37f1950ea653b9 · llvm/llvm-project · GitHub
[3] llvm-project/Local.cpp at a33f018b89c07e0728539b34c158e88a7db49982 · llvm/llvm-project · GitHub
[4] Cue gasps of horror from the audience!
[5] LLVM Compile-Time Tracker
[6] LLVM Compile-Time Tracker NOTE: the -g modes are not meaningful as all debug-info is dropped.

6 Likes

This sounds great, thank you for the write-up!

To recap (so you can tell me if I’m missing something):

  • The current API essentially has this distinction between “preserving”/“breaking” already, it just is not explicitly recorded anywhere because the debug markers are actually Instructions, and so the behavior we want essentially falls out: you generally move all instructions when you are “preserving” and a single instruction when you are “breaking”.
    • An example where debug-info should move with instructions is e.g. inlining. When we move many instructions at once, we naturally choose to bring along the intersperced debug-info instructions. This corresponds to the “preservering” disposition.
    • An example where debug-info should not move with instructions is e.g. CSE. The debug-info describing when e.g. an assignment occurs doesn’t want to be moved relative to the rest of the instructions just because CSE hoisted one of them. This corresponds to the “breaking” disposition.

Assuming I’m following so far, the choice to move the high-level disposition into the iterator seems reasonable to me. If there is not a significant runtime performance hit then I agree doing this in the type system in C++ doesn’t sound appealing.

I think I do get a little lost when trying to understand the piece about begin/getFirstInsertionPt. In the example of:

bb2:
    dbg.value(A, ...)
    %1 = add...
    dbg.value(B, ...)
    %2 = sub...
    br label %exit

Does this translate to something like:

bb2:
    %1 = add... !dbg.attachments {dbg.value(A, ...)}
    %2 = sub... !dbg.attachments {dbg.value(B, ...)}
    br label %exit

?

And so the debug-info “of” a particular instruction takes effect before it?

If so, then I don’t understand the example of MergeBasicBlockIntoOnlyPred. It seems like it would naturally fall out that DestBB->splice(DestBB->begin(), PredBB) would always result in the whole PredBB including its debug info coming before the first instruction and first debug info of DestBB. Maybe that’s your point, as it is using begin, but then I don’t understand in what situation another behavior is possible.

I admittedly have not spent the time to look at the code yet, so sorry if this is an obvious question!

The current API essentially has this distinction between “preserving”/“breaking” already, it just is not explicitly recorded anywhere because the debug markers are actually Instruction s, and so the behavior we want essentially falls out:

Yup, that’s exactly it, and the two examples are correct too,

Does this translate to something like […] and so the debug-info “of” a particular instruction takes effect before it?

That’s correct – I’ve been trying to avoid any textual representation because that’s a discussion I want to avoid, but what you wrote is a fair illustration of the information that would be stored.

If so, then I don’t understand the example of MergeBasicBlockIntoOnlyPred . It seems like it would naturally fall out that DestBB->splice(DestBB->begin(), PredBB) would always result in the whole PredBB including its debug info coming before the first instruction and first debug info of DestBB . Maybe that’s your point, as it is using begin , but then I don’t understand in what situation another behavior is possible.

Taking a quick survey of the calls to “blockSplice” in the prototype code, half of the call sites insert at either begin() or end(), while the other half insert at a specific instruction. There are a few examples in InlineFunction.cpp where groups of allocas are shuffled around, or single blocks are directly inlined into the call-site. In those scenarios, the splice function needs to insert instructions after any attached debug-info, if we had:

bb1:
  %2 = add i32 %1, %0
  call void @foo(%2), !dbg.attachments {dbg.value(A,...)}
  br label %bar

And inlined the contents of foo into the block at the position of the call instruction, the attachments should be moved onto the first instruction inlined so that they’re visible before the call site. If that didn’t happen, they would sink down to the branch instruction when the call was deleted, and this wouldn’t accurately represent the source program. Hence, there’s a need for the splice function know whether the caller wants the moving instructions to come before or after the debug-info at the insertion point.

That being said: there are only about 15 sites where the blockSplice method is actually needed, it wouldn’t be a huge burden to expect callers to express this information as an argument, instead of as information baked into the iterator. The larger problem is several hundred call sites where instructions are inserted at the start of a block with something like moveBefore, and the pass author has made a decision about whether the instruction should come before or after debug intrinsics by calling begin/getFirstInsertionPt for inserting before, or getFirstNonPHIOrDbg for inserting after. When we have debug intrinsics with iterators, that distinction is communicated to the inserting function by the position being a debug-instruction (or not), in the prototype it’s communicated by the bits in the iterator object.

These decisions (insert at the start of the block, before or after debug-info?) have to be preserved to get an identical binary, however it’s not clear whether they’re necessary for preserving the quality of debug-info. Perhaps it doesn’t matter whether a new instruction generated by SCEV goes before or after variable assignments; on the other hand if we sink an instruction from a parent to a successor block then we might care what order it has with the variable assignments in the successor. Baking this information into the iterator avoids having to answer those questions which may seem like an evasion, but on the other hand it guarantees no regressions.

Ah, that makes sense, thank you! I think I just needed to spend a bit more time to get to the cases you were referring to, but took the lazy way out and just asked. Thank you for humoring me :slight_smile:

I definitely see the benefit of retaining the information implied at the creation of the iterator. It seems the most elegant and least error-prone option.

I really like the approach of starting from true “NFC”, and I’m a bit shocked you were able to achieve it! I also particularly enjoy the “inhale”/“exhale” approach around serialization to defer updating tests. I will have to spend a bit more time reading the code, but I’m fairly convinced of the approach.

[Thanks Scott,]

Here’s an outline of how we’d like to move this forwards; we can put a lot of non-invasive changes in tree with zero impact on the rest of LLVM, which makes development easier for us. We’re continuing to iterate on the data-structures used for non-instruction debug-info, it’s at the point where -g compiles are a little faster (while still doing a lot of redundant work), more news on that in the future.

  1. We can fix a variety of debug-info-affects-codegen faults we’ve found in this process,
  2. Add iterator-accepting insertion handlers and utilities, for the most part this is only changing whether an iterator is dereferenced or getIterator is called on an instruction,
  3. Channel all block splicing through a single function (this has largely been done on main by others already),
  4. Use a specialised moveBefore method with the “moving debug-info” disposition at about 20 locations in LLVM

That’ll reduce the size of delta we have from ‘main’ to roughly 30 files, which is much easier to manage. Once we’ve got a Good ™ in-memory representation, we’ll make another proposal about those and:

  1. Add the relevant files and plumbing to the llvm/lib/IR directory,
  2. Add parallel implementations of the 15 to 20 places that directly update dbg.values today,
  3. Add the relevant bits to BasicBlock::iterator as mentioned above.

This will allow for more equivalence testing and experimentation by other developers, plus evaluation of the compile-time benefits. The final steps would be working out how to represent this information in textual IR and bitcode, plus update the 1000 tests containing dbg.values.

1 Like

Hello folks – FYI I gave a presentation at EuroLLVM illustrating the ideas behind the proposal above, and exploring where the costs are, I’ll link to that from here when the videos are up,

I’ve started uploading our patches to Phabricator, forming a stack based on [0]. If you look at the five patches based on that, you’ll see the extent of the invasiveness of these changes: basically turning a bunch of things that were instruction pointers into iterators. I don’t think this is especially remarkable as it’s effectively a spelling change, although you might say that the iterator form is slightly more verbose.

Technically these changes are NFC (aside from D151419 itself) – however they won’t be NFC in the context of a debug-info-intrinsic free LLVM, because they will be transmitting more information about the debug-info context through the BasicBlock::iterator object once that happens. This raises an annoying question of how do we go about testing these changes. So far I’ve been comparing the output of a compile with no intrinsics, with one that does use intrinsics, to determine differences from the “correct” output: however this isn’t a long term strategy as we want to get rid of intrinsics entirely. It’s also infeasible to write a test for each changed code path because there are going to be hundreds of them.

If we’re willing to accept that with these instruction API changes the only way to insert instructions is going to be with an iterator, then the changes I’ve linked to don’t need to be tested: they’re just the first tranche of changes needed in replacing one set of APIs with another. That’s my prefered direction of travel / interpretation – no-one has yet objected to the idea of requiring insertion with a BasicBlock::iterator, or the need to store an extra bit in the iterator class. If you have an objection to this idea, please do speak up now!

Testing for the correct placement of variable location information is an open problem as we don’t have comprehensive testing of debug-info in the context of every optimisation transformation.

Going forwards: the next set of patches to put in that stack will be for a naive implementation of storing variable assignment information outside of instructions. I’d prefer to avoid premature optimisation at this stage – we’re still optimising it, but the validity of the approach can be tested / evaluated before patches to make it fast are iterated on and landed. The final set of patches are per-pass extra instrumentation needed for scenarios where debug-info has special handling (i.e. inlining etc).

[0] https://reviews.llvm.org/D151419

1 Like

I think a basically mechanical change from pointers to iterators should be NFC and not require testing. That sort of thing happens all the time.

I suspect that the conversion from LDI (legacy debug intrinsics) to NDI (new debug info) may have to suffer through similar conversion efforts as the pass-manager and opaque-pointer changes; over some period, we’ll have to support both, and work through all the existing tests to convert them over. I suspect it will go faster than the previous changes because debug info tests are much less pervasive. Or possibly we won’t have to support both, but at least accept debug intrinsics on IR input and handle them in some reasonable way. (Will need to have that for IR auto-upgrade anyhow.)

One thing that will have to happen, and I know you’ve been wanting to defer it, is to define the textual IR representation of NDI so that it can in fact be tested by something short of producing a final object file and dumping the DWARF/CodeView.

Hi,

The EuroLLVM talk I gave on this topic is up [0], the tl;dr is “let’s capture more information about the intention of optimisation passes moving instructions around, so that we can maintain debug-info automatically”. I’ve now uploaded the core parts of these changes, elaborated below. The net effect is that the position of debug-info records can be preserved as if they were instructions, without them actually being stored as instructions – it’s now time to talk about the /other/ hard part of removing debug intrinsics, which is trading slightly increased memory consumption in normal builds for lower consumption + faster compilation in -g builds. (I’ll talk about that in this topic to avoid fragmenting discussion).

In the patches linked below, we’ve added a new pointer field to class Instruction to reference any debug-info “on” the Instructions position. There’s also a new field in BasicBlock to handle scenarios where blocks transiently contain only debug-info [1]. The memory-cost impact of this over builds of an older clang is (a median of) 0.6% more max-RSS for release builds (-O3, no debug-info). There’s broadly no change in memory usage for debug-info builds, although I’m confident we’ll be able to make improvements in some time, we haven’t tried yet. In return, there are compile-time improvements for debug-info builds, see further below. To me and my use case, this is an obviously beneficial trade-off: debug-info builds always take longer and uses more memory, and developers always end up debugging their code, so that’s always the eventual worst case for all software. Trading more memory on a release build to reduce the worst-case debug-info compile time is great. I imagine not everyone agrees though, so I’d like to draw attention to it and ask… is this alright?

In terms of compile time costs: right now on the compile-time-tracker [2] there’s a ~2.5% geomean speedup for -g -flto. Some speedups are disproportionately large, tramp3d-v4 is almost 10% faster. The large C++ code-base mentioned in the talk where we attributed 30% of LTO compile-time to variable locations speeds up about 6%. As mentioned elsewhere, this is a naive / simple implementation, and I think in the mid term we can do a lot better than a 2.5% speedup. There’s a whole bunch of other things we could do with debug-info connections not being in the Value/Metadata hierarchy such as:

  • Hard-coding constant-value variable assignments, there’s no need for them to be part of a large metadata use list,
  • Allocate blocks of debug-info records together: they should never be deleted during optimisation, and nothing should ever be inserted in the middle, so they don’t need to be individually re-orderable or freeable,
  • Almost one-quarter of variables are totally undef by the end of compilation, if we could index them better, we could delete them earlier.
  • Pooling metadata references (maybe, unsure) as typically debug-info records move in packs.
    but this is the first step towards exploring that design space.

There’s also a 0.7% slowdown for normal builds: some of this is because we’re paying the branching cost of debug-info maintenance twice, we repeatedly check whether instructions are debug-info or not even when there aren’t any. A previous experiment disabling getNextNonDebugInfoInstruction and debug-info-iterators demonstrated a 0.3% speedup for normal builds [3], there may be more improvements to redeem, we still need to dig through that. Exactly how to stage this into the repo to avoid everyone paying that cost isn’t clear yet.

For the patches themselves, here’s a quick summary of the main five:

Here’s an illustration of the connections between these data structures:

    Instruction>>>>>>>>>>Instruction>>>>>>>>>>Instruction
         |                                         |
      DPMarker                                  DPMarker
       /   \                                     /   \
      /     \                                   /     \
  DPValue  DPValue                          DPValue  DPValue

Instructions (which are now never dbg.values) point at an optional DPMarker, which itself contains a list of DPValues. The markers represent a position in the program, while the DPValues record a Value and details about source variables, much like dbg.values.

There are a bunch of additional, small(er) patches, every time that dbg.value intrinsics are updated needs additional instrumentation. These are pretty straight forwards though and aren’t as important as thinking about data structures.

Patch reviews would be most welcome, or feedback / opinions on the overall direction we’re taking. If we can get some or all of this in-tree and controlled by a cmake flag, it’ll enable other developers to make their own evaluations of the costs and benefits.

Is anyone opposed to these trade offs? To summarise, according to CTMark there’s a reduction of 2.5% compile time for LTO with debug info builds at the cost of an increase of 0.6% max RSS in release builds. We also remove debug instructions entirely (which will ease maintenance of all optimisation passes) for the cost of maintaining both systems until the work is completed.

[0] 2023 EuroLLVM - What would it take to remove debug intrinsics? - YouTube
[1] This can probably be shuffled somewhere else due to it’s rare use, something like a DenseMap in Function.
[2] LLVM Compile-Time Tracker
[3] LLVM Compile-Time Tracker (don’t look at the -g runs, it was hard-coded off).


Thanks,
Jeremy

5 Likes

Hi folks,

An update on this – the major plumbing to make this all work is now landing / has landed, with anything that might cause a performance regression guarded by a cmake flag, LLVM_EXPERIMENTAL_DEBUGINFO_ITERATORS. With D154372 landed, LLVM will automagically transform debug-info into a non-instruction representation for the duration of optimisation passes if you pass -mllvm --experimental-debuginfo-iterators=true. This doesn’t achieve very much with what’s in-tree right now though.

The next step is installing a little more plumbing, followed by updating various optimisation passes to support transformation/maintenance of DPValues, the replacement for dbg.value intrinsics. This is going to take the form of parallel implementations of various functions using the new types and objects. On the one hand it feels redundant to just copy+paste a lot of code, but I’d like to view it as a feature: the basic principles behind how LLVM handles debug-info aren’t changing, just some of the types, hence we need duplicate implementations during a transitional period [0].

Making sure this duplicate code gets tested is important to avoid it immediately rotting. We’ve got a buildbot [1] running with the CMake flag enabled on it, and I’m planning on landing this patch [2] that enables the new-debug-info mode only if it’s built in. That’ll allow us to duplicate some RUN-lines in tests with that flag, meaning we get some test coverage via that buildbot.

Once those patches are all up I’ll post a call-for-testing with instructions on how to test out the debug-info stuff, and what kind of testing is needed. Most of what we’ve been doing is building programs, observing differences in the debug-info output, then letting llvm-reduce narrow down where the difference is for us, so it’s not labour intensive.

@StephenTozer is cooking up some patches for changing the textual IR format [3], more news on that shortly.

[0] Hopefully a few months not years.
[1] Buildbot
[2] [DebugInfo][RemoveDIs] Add flag to use "new" debug-info in opt by jmorse · Pull Request #71937 · llvm/llvm-project · GitHub
[3] [RFC][DebugInfo] Proposed changes to the textual IR representation for debug values


Thanks,
Jeremy

1 Like

Hi all,

We’ve reached a point where what’s in-tree can build an identical(*) stage2 binary of clang, so we’ll be posting a call-for-testing shortly with instructions. It should work without incident and produce debug-info for normal targets using SelectionDAG. Note that for proving equivalence between dbg.value and “RemoveDIs” mode it relies on a set of “equivalence” patches at [0], more details on why below. Rather than bog the call-for-testing thread down in technical details I figured I’d document here what the current limitations / challenges / workplan looks like. The summary is:

  • What we’re hoping to ship and when,
  • Pitfalls and gotchas for developers with these changes,
  • Remaining unfinished work,
  • Work we’ve taken shortcuts on,
  • Performance and future performance-improvements.

Shipping

In a perfect world I’d like to turn RemoveDIs/DDD on “in-memory” in time for LLVM18, although that might be tight due to Christmas holidays etc coming up. We don’t yet have production-ready patches for textual-IR and bitcode-IR format changes. However, given the performance improvement (see later) and that we can move back-and-forth between the two debug-info formats at will, we could have LLVM operate in-memory without debug intrinsics, and convert back to intrinsics once the pass manager completes. This is how all the lit tests we’ve instrumented are operating.

I think the likelihood of this happening will be governed by the response to the testing call; please test and raise objections if there’s trouble.

Pitfalls

This work hinges on putting a bit in BasicBlock::iterator to signal some debug-info significance, which raises a couple of scenarios where it’s easy to make mistakes. For example, dereferencing an iterator to the start of a block and inserting there using the Instruction *, instead of passing in the iterator for insertion:

BasicBlock::iterator It = BB->getFirstInsertionPoint();
Inst->insertBefore(&*It);

risks causing debug-info to attach to the “wrong” instruction, i.e. it won’t be in the same place as in dbg.value mode. There are other variations on accidentally dropping the bit stored in the iterator that are easy to write. While this is annoying, I’d like to make two observations:

  • This happens strictly less often than you have to check “is this a debug-info instruction?” in LLVM today, and
  • In my current experience it usually leads to debug-info records attaching to PHI instructions, which is something we can detect and assert on.

Thus the trade-off is a frequent, hard to detect problem that silently breaks your optimisations when you don’t expect it (with -g), versus a less frequent problem that stands a chance of being statically analysed and a reasonably high probability of causing assertion / verifier failures when you get it wrong. Plus, we can remove any instruction-ptr-taking insertion APIs to encourage the correct use of iterators.

Remaining work

Here’s a list of things that don’t work yet, why, and what we’re doing about them:

  • dbg.assign: this needed an extra replaceable metadata reference to be tracked, which we’ve been deferring. Currently prototyping it.
  • dbg.declare and dbg.label: these are much simpler to address, and we wanted to deal with the hard problems first, so have ignored them so far. We’ve got patches up for review that move dbg.declares out of the instruction stream, working on dbg.label.
  • Not all of the lit tests pass: because we’ve been focusing on having identical binary outputs, and there are a number of tests that examine things like -debug output. That needs cleanup, we don’t anticipate any fundamental issues though.
  • As mentioned above we’ve got textual IR and bitcode prototypes, but they’ll necessarily have a good amount of time being reviewed and examined, so they’re on the backburner right now.

Shortcuts

There are a number of things we simply haven’t done, not because it presents fundamental issues but because the fixes are annoying. It’s also why what’s in-tree right now won’t produce exactly identical results yet: to get an identical binary we have to track what current LLVM does bug-for-bug, which we’d rather not commit and write patches for. Here’s a brief summary, but there’s more information in the commit messages in the patches at [0]:

  • Updating dbg.values when sinking instructions from instcombine: this wouldn’t need to happen if the codegen backends for all architectures used instruction referencing, which is designed to support expressing use-before-defs in debug-info. We’ve got a patch, but it needs productising.
  • Loop-strength-reduction’s salvaging of dbg.values into variadic expressions: the code is tightly coupled with the storage model for dbg.values, and we don’t want to write templates, so have ignored it for now.
  • SimplifyCFGOpt::hoistCommonCodeFromSuccessors tries to zip sequences of dbg.values together, in a way that feels fragile to me and relies heavily on there being a total-order on dbg.values and instructions. This is fiddly to re-implement, so we haven’t done it yet.
  • GlobalISel: we’re not familiar with gisel, so haven’t instrumented it. The prototype [0] we have just converts back to dbg.value form at the start of gisel, so this isn’t a blocker for testing.
  • There are numerous hunks that fix debug-info-affects-codegen issues, which we don’t really want to write a lot of tests for because we’re trying to design them out of LLVM.

Performance and future improvements.

Here’s a compile-time-tracker link with the performance outcomes of turning on “RemoveDIs” mode [1] showing about 2% speedup overall. Note that this comparison features a few commits that aren’t in-tree yet, and the marginal slowdown in non-debug-info builds can be paid for by getting rid of getNextNonDebugInfoInstruction (see post 8 in this thread). It’s an alright start – but I think we can probably do much better, see some of the ideas in post 8 of this thread again. Removing metadata connections for constant-valued and undef debug-info records seems especially likely to increase efficiency as it’s going to reduce the size of use-lists/use-maps.

We believe the largest performance benefits are proportionate to inlining: heavily inlined programs increase the number of debug-info records massively.

One of the things discovered during this exploration is that there’s a large amount of DWARF output that depends on the ordering of debug intrinsics to provide stability. For example we order .debug_loclists by the order that we discover variables in. Having to preserve strict ordering during compilation is a pain, and if we can eliminate that we’ll likely unlock even more improvements.

Many thanks to my colleagues @OCHyams and @StephenTozer who’ve been writing and reviewing patches on this topic for a long time.

[0] GitHub - jmorse/llvm-project at removedis-comparison-patches currently a70773d05cec55
[1] LLVM Compile-Time Tracker

3 Likes

Hi all,

I’ve put an introduction to this migration up here [0]. (I’ve no idea where the excess vspace comes from after the opening paragraph, not a web developer,)

We’re rapidly approaching the point where we can turn this feature on (i.e., possibly in the next couple of days) – there are a few remaining snags to address, some performance loss in non-debug builds that we can reduce, and so forth. I thought I’d outline how we see things landing and progressing:

  1. Soon, we’ll enable LLVM_EXPERIMENTAL_DEBUGINFO_ITERATORS and the --experimental-debuginfo-iterators flag by default, then wait for bug reports in case we need to do a quick revert. This will cause debug intrinsics [1] to be converted to “new” debug-info when it enters the pass manager or gets loaded from bitcode; and converted back to intrinsics when it’s written out. This will all happen “in-memory”.
  2. Assuming there’s no major turbulence from that, after a couple of weeks we’ll delete some shims we have to check/ensure output stability: some attributes re-order as a result of RemoveDIs and we don’t want to generate a lot of test updates while there’s still a risk of RemoveDIs getting turned off quickly.
  3. We’ll then land changes to the bitcode IR format and the textual IR format, providing some scripts and switches to ease downtstream adoption.

The aim of this plan is to hopefully get LLVM operating in memory without using intrinsics ASAP, converting back to intrinsics at the terminals, so that we can get this feature on and being tested without burdening anyone with immediate test maintenance. We can then spread the transition to a new output format over the course of LLVM19, instead of trying to make everything happen at once.

[0] https://llvm.org/docs/RemoveDIsDebugInfo.html
[1] Except dbg.label, that’s still coming.


Thanks,
Jeremy

2 Likes

Hi,

We’re now at the cusp of turning this on. When that switch flips, it’s going to cause LLVM to use non-instruction debug-info /only in the context of the pass manager/. That means all optimisation passes will be exposed to this, and we believe we have them all instrumented to act identically to the past. However the rest of LLVM will act as before, printing intrinsics in the output, and in bitcode.

We had a couple of abortive attempts to draw a line through LLVM so that we could stage these changes incrementally, and this is what we’ve come up with. The other paths we’ve explored all involve a “big bang” where a lot of the edges of the compiler change and cause friction. If this works for everyone in the pass manager, we’ll be able to incrementally push outwards from there and loop in other parts of LLVM. Plus, this is allowing us to avoid changing the textual IR for the moment, which is going to be a chore.

Flipping this switch is going to expose downstream tests to RemoveDIs, and I imagine people will encounter some variation in their tests, so I thought I’d summarise the kind of variation we’re expecting see. These come in the forms of “inconsequential output variation”, “Your downstream code might need updating” and “definitely a bug”.,

You can keep your merge going by pinning tests to intrinsic-mode temporarily. Add “–experimental-debuginfo-iterators=false” to your RUN lines. You can also open github issues and link them to [DebugInfo][RemoveDIs] Umbrella ticket for problems with "new" non-instr debug-info · Issue #74735 · llvm/llvm-project · GitHub , and we can give you a hand.

The inconsequential variations we’d expect to see:

  • Attributes attached to the declarations of debugging intrinsics may change, because the declarations aren’t present during the optimisation pipeline. There’s no functional consequence to this change.
  • Debug intrinsic calls in the output may grow the “tail” flag. The flag is meaningless for debug intrinsics, usually attached eagerly during optimisation. We’ve canonicalized the output to always be marked “tail” now as it breaks the fewest tests, with CHECK lines that don’t expect “tail” safely ignoring it.
  • Some tests examine the output of passes printed with the “-debug” flag, and won’t show any intrinsics as they’re not present any more. Unless it’s a debug-info pass, variation here is fine.
  • Numerous issues where debug-info affects the code generated by LLVM are going to fix themselves: if you see codegen changes, check whether your test accidentally relies on such a bug by passing -strip-debug to opt, to obtain the correct code to be generated.
  • Source locations from debug-intrinsics often leak into “real” instructions, something that won’t happen any more. This can cause metadata node-numbers to re-order: if your test relies on hardcoded numbers, use FileCheck variable captures instead.

Issues with downstream code might cause some variation too – there’s some documentation here What’s all this then? — LLVM 19.0.0git documentation about what’s going on.

  • The most likely cause of any meaningful test variation is using instruction pointers to express an insertion position: you now need to use a BasicBlock::iterator instead. This can present itself as dbg.value intrinsics shifting around.
  • Assertions firing saying that PHI nodes are not grouped at the start of a block are caused by the above bullet point too. Not using an iterator can cause PHIs to be inserted after debug-info records instead of before them, which then causes issues when the IR is converted back to intrinsic form.
  • Assertions about “IsNewDbgInfoFormat” having a mismatched value indicates your code has ended up mixing intrinsic-form and non-intrinsic-form debug-info. If you’re creating and using Functions and BasicBlocks before inserting them into a parent, they’ll default to “IsNewDbgInfoFormat=false”.

Any crashes aside from the two assertion-failures mentioned above means something is definitely wrong. Any scenarios where there are too many debug intrinsics printed in the output or too few means something’s very wrong too. We’ve tried to reduce test-changes involved in apply the switch-throwing patch, so it should be an easy revert if it becomes necessary.

2 Likes

We have downstream code that creates new functions with Function::Create(Ty, Linkage, AddrSpace, N, M) which immediately inserts them into module M. We have found that they still default to IsNewDbgInfoFormat=false. Is there a better way of creating-and-inserting them, so that they automatically pick up the flag from the parent module?

new functions with Function::Create(Ty, Linkage, AddrSpace, N, M) which immediately inserts them into module M. We have found that they still default to IsNewDbgInfoFormat=false.

Hmmm – it looks like that Function constructor is directly inserting into the module-function-list and skipping the flag setting, should be fixed by [DebugInfo][RemoveDIs] Set new-dbg-info flag from Modules correctly by jmorse · Pull Request #82373 · llvm/llvm-project · GitHub .

Alas, I don’t believe there’s an easier way of setting that flag, as its purpose is to detect situations where old-debug-info is being created and then inserted into a new-debug-info context. That requires using the Module flag as a source-of-truth.

The flag is just a development aid that helped us find problems while implementing the RemoveDIs work – if it’s now producing noise instead, we could probably downgrade it to an optional warning / debug output. We should be getting rid of it entirely within a month anyway.

While I get the point of this automatic conversion between old and new debug formats, it has the unfortunate consequence that in a downstream project that breaks because of not setting the new-dbg-info flag somewhere, adding -print-before-all -print-after-all is made ineffective in debugging as it causes the debug info to be converted and the assertion failure that I am trying to fix no longer shows up. Per comment 11, this situation is intended to last for a few months. Is there an option to skip this conversion? I am not seeing one; if there is not one, could one be added?

The error I am getting seems to be due to Module::getOrInsertFunction creating a function outside of any module, implicitly using old debug info format, and then inserting it into the module. When Function’s constructor taking a parent module is used, the debug info format is now inherited from the module. Should Module::getOrInsertFunction do the same, or is it the responsibility of its users to set the debug info format?

Another question, based on That means when writing passes that insert LLVM IR instructions, you need to identify positions with BasicBlock::iterator rather than just a bare Instruction *.: what is the intended update of code that calls e.g. BinaryOperator::Create* which take only an Instruction * and have no overload that takes a BasicBlock::iterator?

It seems pretty obvious to me that Module::getOrInsertFunction should copy the flag from the module to the new function, just like Function::Create does now.

Thanks, that seemed like the most sensible thing to me too, created Module::getOrInsertFunction: set debug info format by hvdijk · Pull Request #82505 · llvm/llvm-project · GitHub for it.

Hi,

That should be fixable (thanks for the PR!), it’s beginning to sound like this assertion is providing more noise than signal though. If the patch doesn’t fix it or if anyone else downstream is running into difficulties, we’ll downgrade it to some kind of warning or optional development aid. (As mentioned elsewhere, it used to be useful, but it doesn’t look like it is any more).

Another question, based on [link] That means when writing passes that insert LLVM IR instructions, you need to identify positions with BasicBlock::iterator rather than just a bare Instruction *: what is the intended update of code that calls e.g. BinaryOperator::Create* which take only an Instruction * and have no overload that takes a BasicBlock::iterator ?

Good question – ultimately these should become methods that take iterators, we’ll be sweeping the codebase to provide both shortly, and eventually only iterators. For the time being code that inserts with an Instruction * runs the risk of changing the position that variable locations appear at the start of a block, in some rare circumstances. Those will go away when insertion is via iterators.

That’s actually the summit of the whole piece of work here: carrying a bit of information via the iterators means all the debug-info accounting can happen in the background. It’s what’s going to free us all from skipDebugIntrinsics and similar debug-info mangling during analyses!

It didn’t fix all instances of the error, but the next one I encountered is where the error is useful, where it highlights a real problem: building one module in memory, implicitly using the new debug info format, loading another module from disk, implicitly using the old debug info format, and copying functions over from one to the other. That would go wrong if the error were just ignored, so I appreciate the error.

Possibly something else in LLVM should be changed there, it seems unfortunate that it’s possible to implicitly mix debug info formats; I would have expected that any code that does not touch debug info format either consistently uses the old format, or the new format.

It’s not just subtle mostly harmless debug info that changes, in my case it also resulted in a PHI being inserted in the wrong place, meaning debug info appeared before PHIs, which LLVM then hit an assertion failure on.

If anyone else hits this before the new API changes go in, the versions that do not insert an instruction, and then manually inserting the instruction, works fine.