[RFC] Local Per-Component IR Checkpointing

Based on the original RFC’s feedback ([RFC] Lightweight LLVM IR Checkpointing), here is an alternative scheme that provides checkpointing functionality to specific components of the IR. This should cover most of the use cases while considerably reducing the risk of rolling back to a bad state, and without making it too hard to use.
Please let me know what you think.

Local Per-Component IR Checkpointing

Summary

Given that (i) global LLVM IR checkpointing is unlikely to cover 100% of the IR state, at least initially, (ii) not covering 100% of the state in global-checkpointing breaks the contract with the user and can cause bad rollbacks, (iii) we cannot guarantee that all new state gets automatically tracked whenever the core IR data structures change, and (iv) checkpointing is most probably going to be used locally for emitting/modifying a few instructions and rolling back if needed, let’s consider an alternative to global checkpointing: a checkpointing scheme that has local scope by design and applies to specific components of the IR independently.

The API of local checkpointing is very explicit about the scope of the state that gets tracked and can be rolled-back. For example if it supports two IR components, the BasicBlock and the IRBuilder, the API would look like this:

  • BasicBlock::{save/restore/accept}State(): The state includes the contained instructions, and their state, including instruction deletion. It does not include the BB itself (i.e., moving it within a Function, or deleting it), or anything else beyond the BB.
  • IRBuilder::{save/restore/accept}State(): The state tracks the creation of instructions and the builder’s insert point.

So the user can save/restore the state of specific IR components individually in a fine-grain manner.

Design

Most of the design is similar to the original global checkpointing RFC but I will repeat some of the main points to make this text easier to read.

Classes that track state changes

Similar to global checkpointing, all IR functions within the scope that modify the IR state must include code that tracks the state change. A change-tracking object stores the data needed to track the change and provide two virtual functions: revert() and apply(). We accumulate change-tracking objects in a vector in the same order as the changes occur. The change-tracking objects inherit from ChangeBase, just like in global checkpointing:

class ChangeBase {
// ...
public:
   virtual void revert() = 0;
   virtual void apply() = 0;
};

The class hierarchy is similar to that of global checkpointing:

ChangeBase -+- SetName     // Tracks changes in Value Name
            |
            +- InsertInstr // Tracks inserting an instr in a BB
            |
            +- RemoveInstr // Tracks the removal of an instr from BB
            |
            +- MoveInstr   // Tracks moving an instr before/after
            |
           ...

Accept / Rollback

Just like in global checkpointing, upon a call to restoreState() we visit the changes in reverse and undo them one by one by calling the Change::revert() function. A call to acceptState() iterates across all changes in order and calls Change::accept(), which usually does nothing, except for Changes that delete objects, which is where the actual deletion takes place. This is also identical to the global checkpointing proposal.

Storage

Storage for the actual state change objects can still be placed in LLVMContext, but this time we need dedicated storage for each object being tracked. So we will need a map data structure that maps objects (e.g, BasicBlock *) to their corresponding state tracking data structure, that includes the state change vector. So a call to BasicBlock::saveState() calls LLVMContext::saveBBState() which looks up BB’s state in the map and activates tracking.

void BasicBlock::saveState() {
   getContext().saveBBState(this);
}
void LLVMContext::saveBBState(BasicBlock *BB) {
   ChkpntBBMap[BB]->save(); // Activates tracking for `BB`
}

Tracking changes

The tracking code is similar to that of global-checkpointing but it has an extra step that checks whether that particular change should be checked based on its context. The first step makes sure that we can bail out quickly if there is no active tracking, and the second step checks if the current object is actually being tracked. For example, moving an instruction should be tracked only if its source or destination parent block is being tracked:

void Instruction::moveBefore(BasicBlock &BB, <iterator> It) {
   // Cheap 1st level check skips the expensive part.
   if (LLVM_UNLIKELY(BB.getContext().isCheckpointActive())) {
      // More expensive 2nd level checks that include map lookups
      if (getParent()->isCheckpointActive() ||
          BB.isCheckpointActive()) {
         // Push change into ChkpntBBMap[The correct BB]
      }
   }
   // Code that moves the instruction
}

Compile-time Overhead

The performance impact when checkpointing is inactive should be at the same levels of the global-checkpointing approach, as the change tracking code is structured in a similar way. So I would expect a similar overhead of < 0.1%.

ValueHandles

Changes to the instruction operands (e.g. RAUW) will notify ValueHandles right away, but callbacks about deletion will be deferred until the state is accepted with a call to acceptState(). More on this later.

Code Examples

Here are a couple of examples of how one would use Local Checkpointing to save/restore state.

Example 1: Change operand and RAUW

BB->saveState(); // Starts tracking the contents of BB
Instruction *I = &*BB->begin();
I->setOperand(Foo);
I->replaceAllUsesWith(Bar);
if (shouldKeepIR())
   BB->acceptState();
else
   BB->restoreState(); // Restores state of all instructions in BB and users touched by RAUW

Example 2: Create new instructions

IRBuilder->saveState(); // Starts tracking of instructions created by the builder and  the insert point
BB->saveState();
IRBuilder->setInsertPoint(BB, BB->begin());
IRBuilder->createFoo(...);
Instruction *Bar = IRBuilder->createBar(...);
Bar->moveBefore(Baz);
if (shouldKeepIR()) {
   BB->acceptState();
   IRBuilder->acceptState();
} else {
   BB->restoreState(); // Removes Foo and Bar and restores instruction order
   IRBuilder->restoreState(); // Deletes newly created instructions and restores insert point
}

Example 3: Instruction deletion

BB->saveState();
Instruction *I = &*BB->begin();
I->eraseFromParent(); // This won't actually delete `I`. It is equivalent to I->dropAllReferences() and I->removeFromParent()
if (shouldKeepIR())
   BB->acceptState(); // This actually deletes `I`.
else
   BB->restoreState(); // Restores `BB` to its original state.

Local Checkpointing Addresses the Main Concerns of Global Checkpointing

1. Coverage

In global checkpointing, not covering 100% of the IR is a bug because the contract with the user is that all state gets saved and restored. Even if the user of global checkpointing is aware of its limitations and makes sure that the code works well with it, it can still lead to failures in cases like this: a checkpointed region calls an external function and some other developer changes the code of that function in a way that it modifies state not being tracked by checkpointing. This can lead to a bad rolled-back state.
For example in global-checkpointing:

// The programmer wants to maintain the state of BB.
Chkpnt.save();
someUtilityFn(BB);  // Buggy if `someUtilityFn()` modifies state that is not in checkpoint’s coverage 
Chkpnt.rollback();

With local checkpointing this is no longer an issue as only specific components of the IR feature save/restore functionality and these are guaranteed to cover 100% of their state. The code using checkpointing is designed with specific knowledge of exactly what state can be saved and rolled back. A callee can be updated to modify any state it needs, without causing failures.
For example in local checkpointing:

// The programmer wants to maintain the state of BB
BB->saveState();
someUtilityFn(BB); // It can touch any state. The programmer only cares about the state of BB
BB->restoreState();

Even if someUtilityFn() is modified by some developer to touch state outside the BB, this will not break rollback. After rollback BB will be in its original state and this is exactly what the programmer intended.

2. Updating core IR classes and adding new state

In global checkpointing, adding new state to the IR and forgetting to support it in checkpointing can lead to a bad roll-back state.
This issue is still there in local checkpointing to some extent, but there are two main differences:

  1. The scope is now reduced so we can afford to enable verifier checks in normal Debug builds. These checks can catch such bugs.
  2. The scope is local, so testing is a lot easier and any such issue should be easier to track and manage.

3. State of analyses and data-structures that rely on ValueHandles

Unlike in global checkpointing, the state of these external data-structures will obviously be out-of-scope by design, so the user won’t expect them to be automatically updated.

One issue that remains has to do with deletion of Instructions: the callbacks for deletion of Instructions in a BasicBlock are deferred until the changes are accepted with a call to BasicBlock::acceptState(). I think this behavior makes sense, and is semantically correct because the instruction is not actually deleted until acceptState() is called. If instead we notify the listeners right away that an Instruction got deleted and we later on roll-back, then that would be incorrect, as the instruction would still be around.

4. Deployment

A local scheme can be developed and deployed gradually, one component at a time, because it is very specific about what state can be saved/restored. New functionality can be developed and added as needed. Adding checkpointing support to more IR objects won’t break the existing code.

New Challenges

  • Not as easy to reason about: The user has to be fully aware and be more careful about the exact state that gets saved and restored.
    For example: a call to BB->restoreState() will only rollback the contents of the BB, but will leak memory unless IRBuilder->restoreState() is also called. Also if IRBuilder->restoreState() is called before BB->restoreState() then this will likely cause a crash as the BB’s Change objects will have gone stale.

Given the principle that an API should be easy to use correctly and hard to use incorrectly, have you thought about encapsulating the save/restore/accept part in an RAII class, that will assert if a pass exits after saving but without restoring/accepting? Or, possibly it should default to accepting, assuming that restore is the unusual/exceptional case.

(I normally stay away from optimization passes so I really have no stake in this; it just seemed to like a straightforward way to run the API.)

Yes the original proposal was using a RAII class this exact reason. We can do the same here. Perhaps something like:

auto Chkpnt = BB->getCheckpoint();
Chkpnt.save();
// ...
Chkpnt.restore();

If Chkpnt goes out of scope we can either check if we restored/accepted. I would prefer that we did not just accept, to avoid cases where the programmer just forgets to stop tracking, because there is some overhead associated with it.

I’m sorry but I don’t see how this is an improvement. Doesn’t this have the same problem that if a user of checkpointing modifies part of the IR that isn’t covered by the checkpointing, they will likely see unexpected results?

They look similar, but I think there is an important difference: In the previous version the user expected checkpointing to save and restore the whole IR state. So the problem was that if checkpointing didn’t actually support 100% of the IR, which was quite likely due to either an implementation bug or some change in the IR itself, then a rollback would result in an unexpected state or a crash. And the infrastructure would be to blame for this.
In this version, however, we are very explicit about which state is being saved and restored, and that state is guaranteed to be covered. That is, when you are rolling back the state of a BB there is no expectation from the user to see any other change being rolled back. This reduced scope makes it a lot easier to test and guarantee that the coverage claimed is what the user will actually get. So unlike the previous version, ending up in an unexpected state can no longer be caused by the infrastructure itself, but instead it would always be a logic error from the user’s side.

It would be good to spell out expectations for the long-term evolution, then.

For example, do we expect the checkpointing to eventually cover the insertion or removal of basic blocks, and then how does that affect the API and potential interactions?

I think it is basically guaranteed that if we open the door for checkpointing even a little, people will soon want to open it far wider. And that then raises the question of how a BasicBlock::{save,restore}State interacts with a potential future Function::{save,restore}State etc.

We should at least consider something that’s more of a hybrid API, e.g.

Checkpointer c(getContext());
c.trackBasicBlock(bb);
c.trackInsertionPoint(builder);

...

if (...)
  c.accept();
else
  c.restore();

(Is “checkpoint” even the right name for this? What about “transaction”?)

Eventually we may support this too. I think that would probably be part of the Function state.

Since we are keeping track of the state by collecting changes in the IR, and not by taking a snapshot of the IR, there are a couple of design restrictions:

  1. there must be a single sequence of changes and
  2. the sequence must be kept in order such that reverting the state can be performed in the reverse order.

Nested components like BasicBlocks and Functions may both be tracking changes. Consider some user code like this:

F->save();
BB->save(); // `BB` is in `F`
I->setOperand(...); // `I` is in`BB`

Before recording the change of I->setOperand() we need to check if any of its parent components (that is its parent BB or F) are being tracked. If they are, then we create a new Change object and append it into the change tracking vector. So tracking changes is pretty straight forward.

The more interesting case is when restoring state. What if the user calls a Function::restore() before a BasicBlock::restore() like this:

F->save();
BB->save(); // `BB` is in `F`
I->setOperand(...); // `I` is in`BB`
F->restore();  // Should revert `I`'s state
BB->restore(); // Should do nothing

In this case I think the expected behavior would be for F->restore() to revert the change made to I and for BB->restore() to do nothing.
But the reverse should also behave similarly:

BB->restore(); // Should revert `I`'s state
F->restore();  // Should do nothing 

Overall I think that a Change should be “owned” by all objects that are higher in the hierarchy. That is, any change to an Instruction is owned by its parent BB or its parent Function, or even its parent Module. Restoring the state of any one of its owner functions would revert that Change.

For this to work:

  1. each Change object would need to contain a pointer to the immediate parent object (i.e., the change for I->setOperand() should be tagged with BB as its “owner”), and
  2. restore() should check if the the object being restored either matches the “owner” of the Change or is a parent of the “owner”. For example while executing F->restore() we visit the I->setOperand() Change which is has BB as its “owner”. Since F is BB’s parent then we proceed to revert the change.
  3. Reverted changes are removed from the vector. This guarantees that we revert changes only once.

Following this design things like this are not too confusing:

F->save();
I->setOperand(...); // `I` is in`BB` and `BB' is in `F`
BB->save(); // `BB` is "owned" by `F` which is already tracked.
I2->setOperand(...); // `I2` is in `BB`
F->restore();  // Should revert the state of `I` and `I2`
I3->setOperand(...); // `I3` is in `BB`
BB->restore(); // Reverts `I3`'s state

Yes, I think an API like what you propose is better and less confusing because it implies that there is a single checkpointing entity that tracks changes. So tracking overlapping changes, like tracking BBs and their parent Functions, is less confusing.
Transaction is a good name too, I don’t have a strong preference.

I uploaded the first patch for review ⚙ D147251 [Chkpnt] Initial patch for checkpointing. .
Feel free to take a look.