[RFC] Rewriting the instruction insertion interface

StephenTozer · April 22, 2024, 3:02pm

Summary

For the new debug info format, it has become necessary to explicitly distinguish between different patterns for inserting instructions to prevent incorrect debug info generation. We (Sony) propose a change to the API for moving/inserting instructions to make the correct usage intuitive and easy to verify without requiring any special knowledge of the debug info format.

Background

The RemoveDIs[0] project is nearing its next major milestone of being on-by-default within all tools and in LLVM’s default output. As we move towards wider use of debug records and eventually the removal of debug intrinsics, we’re focusing on updating the way that we handle moving and inserting instructions, to reflect the changes to iterators introduced by RemoveDIs: debug instructions are no longer used, but debug records exist in-between instructions, and it is necessary at times to distinguish between inserting before or after those records.

----> I1 ----------> I2 ----
         <Records>
        ^         ^
   Before         After

Although the work has been done to correct existing move/insert operations in LLVM[1], we need a maintainable approach to preventing new errors from emerging (which will usually cause incorrect variable location ranges, but may result in broken modules), which likely means changing the current interface a bit. A key requirement for any change is that users should need to know as little about debug records as possible to use the interface correctly; the simplest interface to implement would simply be to enumerate the set of insertion positions, but this would be unintuitive without understanding the relevance of debug records. Instead, it would be better to examine the insertion patterns that appear in LLVM and add methods to cover each of those cases that each do the right thing.

The basic operations that place an “inserted” instruction at a position relative to a “destination” instruction or basic block are:

BasicBlock-relative operations:

After the last PHI, BB->getFirstInsertionPt() - Used when inserting instructions at the start of the destination basic block, including PHIs.
Before the terminator, BB->getTerminator()->getIterator() - Used when inserting or moving instructions to the end of the destination basic block.
At the start, BB->begin() - Not used often for inserting or moving instructions, since it would place them ahead of any existing PHIs; exceptions include constructing new blocks, and adding static allocas in a function’s entry block.
At the end, BB->end() - Not used often for the same reasons as begin(), since it would place instructions after the terminator; mostly likely to be used when the terminator is being deleted, such as when joining two basic blocks together.

Instruction-relative operations:

Before - Although the insertBefore(Instruction *) function is called often, this operation is fairly uncommon: often when we call it, we’re using it to perform one of the other operations, e.g. to insert before a terminator or the first non-PHI in a block. We use this when reordering instructions to ensure that destination instruction is dominated by the inserted instruction, or to enable a peephole optimization.
After, to use - Used so that the destination instruction dominates the inserted instruction; this doesn’t always mean a direct use of the destination instruction is required.
After, to replace - Used so that the destination instruction can be replaced by the inserted instruction, either being immediately RAUWd, replaced in part (e.g. splitting a value), or placed for a later peephole optimization to take place.

In addition to the above, there are more complex cases; for example, Instruction::getInsertionPointAfterDef(), which returns a valid insertion point iff the instruction defines a value and is not a callbr instruction; the insertion point returned is either the “After to replace” position (7), or the first insertion point of the containing basic block for a PHI node (1), or the first insertion point of the normal destination for an invoke instruction (1). This and any other specialized insertion patterns should be defined in terms of the basic operations above, and not otherwise considered part of this proposal.

Use vs Replace

From the debug records perspective, the difference between (6) and (7) is when variable values come into effect: for (6), any debug variable changes following the destination instruction will occur before the inserted instruction, meaning that if the destination represents a source variable assignment, that assignment will have taken effect when the user steps on the inserted instruction. For (7), the inserted instruction “absorbs” any debug variable changes that follow the destination, meaning that if the destination represents a source variable assignment, that assignment will take effect after the user steps past the inserted instruction. For example:

  %x = srem i32 %a, %b      ; int x = a % b;
  ; Replace position
    #dbg_record(%x, !"x", !DIExpression())
  ; Use position
  br label %BB.Next
...
  %y = sub i32 %a, %c

If we’re moving %y up so that we can replace %x with it, then we would use the replace position, so that the result is:

  %y = sub i32 %a, %c
    #dbg_record(%y, !"x", !DIExpression())
  br label %BB.Next

If we used the use position instead, then the result would be:

    #dbg_record(%y, !"x", !DIExpression())
  %y = sub i32 %a, %c
  br label %BB.Next

Resulting in a use-before-def from the #dbg_record. To put this technical error in conceptual terms: the “use” position is used for inserted instructions that merely follow their destination; the “replace” position is used for inserted instructions that will form part of the semantic intent of the destination. The replace case thus includes not just RAUWs, but cases where one instruction is split into two - in either case, if the destination represented a source assignment, we would only expect to read the result of that assignment after executing the inserted instruction.

Proposals

Our proposal is to clearly define a set of operations that explicitly target the 7 cases above, with the intent that each operation will be used for just its own case. This is not something that could be enforced at compile time, but the resulting interface should be clear enough that correct usage is intuitive and incorrect usage is relatively easy to spot. Transferring the entire codebase at once would be very difficult, but it should be possible and reasonable to update code gradually.

Firstly, we should try to remove most uses of instructions and basic blocks as pure insert-position parameters, i.e. cases where a function takes a Instruction *InsertBefore argument, or BasicBlock *InsertAtEnd, or similar; these should be replaced by iterator parameters, BasicBlock::iterator InsertAt. Exceptions to this include when:

The parameter is not just used as an insert-position; if the function in question is reading fields of the passed instruction/basic block, then it’s a good sign that the function doesn’t only want a position, so should continue to use the existing parameter.
The function only uses the parameter for an insert position but may conditionally select the position; for example, a function that takes a basic block, and based on some internal logic to the function may insert at either the start or end of the basic block.
The constraint imposed by the existing parameter is important; for example, a function that creates and inserts a PHI instruction may prefer to take a BasicBlock *InsertAtStart parameter to ensure that the PHI is never inserted anywhere other than the start of a basic block.

For the interface itself, iterators should be obtainable from the destination instruction/basic block - this will be the most common way to insert, since many functions will take iterators as insert-at parameters. In keeping with the existing Instruction::insertBefore/After(Instruction *Inst) methods, it may also be desirable to represent the basic insert operators as methods on the inserted instruction; besides reducing the size of the rewrite required, this also puts all of the insertion operators in one place where a user can easily spot them without needing to go to the documentation. This means there would be 14 ways to insert a new instruction:

Case | Destination Method                   | Inserted Instruction Method
-----+--------------------------------------+--------------------------------------------------------
   1 | BasicBlock::getFirstInsertionPt()    | Instruction::insertAtFirstInsertionPt(BasicBlock *Destination)
   2 | BasicBlock::getLastInsertionPt()     | Instruction::insertAtLastInsertionPt(BasicBlock *Destination)
   3 | BasicBlock::begin()                  | Instruction::insertAtBeginning(BasicBlock *Destination)
   4 | BasicBlock::end()                    | Instruction::insertAtEnd(BasicBlock *Destination)
   5 | Instruction::before()                | Instruction::insertBefore(Instruction *Destination)
   6 | Instruction::afterForUse()           | Instruction::insertAfterForUse(Instruction *Destination)
   7 | Instruction::afterForReplace()       | Instruction::insertAfterForReplace(Instruction *Destination)

For each insertAt/insert method, there will be an equivalent moveTo/move method as well. Although there are no strict guard rails to prevent things such as inserting at the end of the block by using I->insertBefore(BB->getTerminator()) instead of I->insertAtLastInsertionPt(BB) or I->insert(BB->getLastInsertionPt()), the list of methods is small enough for most users to learn so that mistakes should be rare and easily spotted in review.

The questions to the wider LLVM community are, does this interface look reasonable and usable to you? Are there any other insertion strategies that should be covered here? And more broadly, there any objections or suggestions that you have about this proposal?

[0] Debug info migration: From intrinsics to records — LLVM 19.0.0git documentation
[1] Remove remaining uses of Instruction-constructors that insert before another Instruction by SLTozer · Pull Request #85981 · llvm/llvm-project · GitHub
[2] llvm-project/llvm/lib/IR/Instruction.cpp at main · llvm/llvm-project · GitHub

jryans · April 26, 2024, 2:03pm

It seems like a reasonable proposal to me, thanks for working on this!

While that will (hopefully) be true especially as everyone gains more experience with the new APIs, I do think it’s critical that we have documentation explaining the purpose of each case so there’s something to refer to if you’re unsure.

It looks like you’ve done a good job describing the purpose of each case and the surrounding context in this RFC. It would be great to capture that in documentation as well for easy reference.

StephenTozer · April 29, 2024, 10:10am

While that will (hopefully) be true especially as everyone gains more experience with the new APIs, I do think it’s critical that we have documentation explaining the purpose of each case so there’s something to refer to if you’re unsure.

Yeah, the docs will of course be updated to explain the API in detail - the intent was just to make it easier for someone who hasn’t read the docs to get a quick grasp of things, since this API is one that people are probably going to deal with while in the middle of fixing some unrelated issue.

Topic		Replies	Views
[RFC] Instruction API changes needed to eliminate debug intrinsics from IR IR & Optimizations debuginfo	20	3606	March 5, 2024
Prototyping a not-an-instruction dbg.value IR & Optimizations debuginfo	3	728	November 17, 2022
PSA: Instruction-constructors changing to iterator-only insertion IR & Optimizations debuginfo , llvm	17	913	April 23, 2024
Small improvement to LLVM debugging experience IR & Optimizations	6	407	March 6, 2025
Debugging information problem: code being reordered / debug point jumping around LLVM Dev List Archives	3	86	August 1, 2011

[RFC] Rewriting the instruction insertion interface

Summary

Background

Use vs Replace

Proposals

Related topics