Based on the original RFC’s feedback ([RFC] Lightweight LLVM IR Checkpointing), here is an alternative scheme that provides checkpointing functionality to specific components of the IR. This should cover most of the use cases while considerably reducing the risk of rolling back to a bad state, and without making it too hard to use.
Please let me know what you think.
Local Per-Component IR Checkpointing
Summary
Given that (i) global LLVM IR checkpointing is unlikely to cover 100% of the IR state, at least initially, (ii) not covering 100% of the state in global-checkpointing breaks the contract with the user and can cause bad rollbacks, (iii) we cannot guarantee that all new state gets automatically tracked whenever the core IR data structures change, and (iv) checkpointing is most probably going to be used locally for emitting/modifying a few instructions and rolling back if needed, let’s consider an alternative to global checkpointing: a checkpointing scheme that has local scope by design and applies to specific components of the IR independently.
The API of local checkpointing is very explicit about the scope of the state that gets tracked and can be rolled-back. For example if it supports two IR components, the BasicBlock
and the IRBuilder
, the API would look like this:
-
BasicBlock::{save/restore/accept}State()
: The state includes the contained instructions, and their state, including instruction deletion. It does not include the BB itself (i.e., moving it within a Function, or deleting it), or anything else beyond the BB. -
IRBuilder::{save/restore/accept}State()
: The state tracks the creation of instructions and the builder’s insert point.
So the user can save/restore the state of specific IR components individually in a fine-grain manner.
Design
Most of the design is similar to the original global checkpointing RFC but I will repeat some of the main points to make this text easier to read.
Classes that track state changes
Similar to global checkpointing, all IR functions within the scope that modify the IR state must include code that tracks the state change. A change-tracking object stores the data needed to track the change and provide two virtual functions: revert()
and apply()
. We accumulate change-tracking objects in a vector in the same order as the changes occur. The change-tracking objects inherit from ChangeBase
, just like in global checkpointing:
class ChangeBase {
// ...
public:
virtual void revert() = 0;
virtual void apply() = 0;
};
The class hierarchy is similar to that of global checkpointing:
ChangeBase -+- SetName // Tracks changes in Value Name
|
+- InsertInstr // Tracks inserting an instr in a BB
|
+- RemoveInstr // Tracks the removal of an instr from BB
|
+- MoveInstr // Tracks moving an instr before/after
|
...
Accept / Rollback
Just like in global checkpointing, upon a call to restoreState()
we visit the changes in reverse and undo them one by one by calling the Change::revert()
function. A call to acceptState()
iterates across all changes in order and calls Change::accept()
, which usually does nothing, except for Changes that delete objects, which is where the actual deletion takes place. This is also identical to the global checkpointing proposal.
Storage
Storage for the actual state change objects can still be placed in LLVMContext
, but this time we need dedicated storage for each object being tracked. So we will need a map data structure that maps objects (e.g, BasicBlock *
) to their corresponding state tracking data structure, that includes the state change vector. So a call to BasicBlock::saveState()
calls LLVMContext::saveBBState()
which looks up BB’s state in the map and activates tracking.
void BasicBlock::saveState() {
getContext().saveBBState(this);
}
void LLVMContext::saveBBState(BasicBlock *BB) {
ChkpntBBMap[BB]->save(); // Activates tracking for `BB`
}
Tracking changes
The tracking code is similar to that of global-checkpointing but it has an extra step that checks whether that particular change should be checked based on its context. The first step makes sure that we can bail out quickly if there is no active tracking, and the second step checks if the current object is actually being tracked. For example, moving an instruction should be tracked only if its source or destination parent block is being tracked:
void Instruction::moveBefore(BasicBlock &BB, <iterator> It) {
// Cheap 1st level check skips the expensive part.
if (LLVM_UNLIKELY(BB.getContext().isCheckpointActive())) {
// More expensive 2nd level checks that include map lookups
if (getParent()->isCheckpointActive() ||
BB.isCheckpointActive()) {
// Push change into ChkpntBBMap[The correct BB]
}
}
// Code that moves the instruction
}
Compile-time Overhead
The performance impact when checkpointing is inactive should be at the same levels of the global-checkpointing approach, as the change tracking code is structured in a similar way. So I would expect a similar overhead of < 0.1%.
ValueHandles
Changes to the instruction operands (e.g. RAUW) will notify ValueHandles right away, but callbacks about deletion will be deferred until the state is accepted with a call to acceptState()
. More on this later.
Code Examples
Here are a couple of examples of how one would use Local Checkpointing to save/restore state.
Example 1: Change operand and RAUW
BB->saveState(); // Starts tracking the contents of BB
Instruction *I = &*BB->begin();
I->setOperand(Foo);
I->replaceAllUsesWith(Bar);
if (shouldKeepIR())
BB->acceptState();
else
BB->restoreState(); // Restores state of all instructions in BB and users touched by RAUW
Example 2: Create new instructions
IRBuilder->saveState(); // Starts tracking of instructions created by the builder and the insert point
BB->saveState();
IRBuilder->setInsertPoint(BB, BB->begin());
IRBuilder->createFoo(...);
Instruction *Bar = IRBuilder->createBar(...);
Bar->moveBefore(Baz);
if (shouldKeepIR()) {
BB->acceptState();
IRBuilder->acceptState();
} else {
BB->restoreState(); // Removes Foo and Bar and restores instruction order
IRBuilder->restoreState(); // Deletes newly created instructions and restores insert point
}
Example 3: Instruction deletion
BB->saveState();
Instruction *I = &*BB->begin();
I->eraseFromParent(); // This won't actually delete `I`. It is equivalent to I->dropAllReferences() and I->removeFromParent()
if (shouldKeepIR())
BB->acceptState(); // This actually deletes `I`.
else
BB->restoreState(); // Restores `BB` to its original state.
Local Checkpointing Addresses the Main Concerns of Global Checkpointing
1. Coverage
In global checkpointing, not covering 100% of the IR is a bug because the contract with the user is that all state gets saved and restored. Even if the user of global checkpointing is aware of its limitations and makes sure that the code works well with it, it can still lead to failures in cases like this: a checkpointed region calls an external function and some other developer changes the code of that function in a way that it modifies state not being tracked by checkpointing. This can lead to a bad rolled-back state.
For example in global-checkpointing:
// The programmer wants to maintain the state of BB.
Chkpnt.save();
someUtilityFn(BB); // Buggy if `someUtilityFn()` modifies state that is not in checkpoint’s coverage
Chkpnt.rollback();
With local checkpointing this is no longer an issue as only specific components of the IR feature save/restore functionality and these are guaranteed to cover 100% of their state. The code using checkpointing is designed with specific knowledge of exactly what state can be saved and rolled back. A callee can be updated to modify any state it needs, without causing failures.
For example in local checkpointing:
// The programmer wants to maintain the state of BB
BB->saveState();
someUtilityFn(BB); // It can touch any state. The programmer only cares about the state of BB
BB->restoreState();
Even if someUtilityFn()
is modified by some developer to touch state outside the BB, this will not break rollback. After rollback BB
will be in its original state and this is exactly what the programmer intended.
2. Updating core IR classes and adding new state
In global checkpointing, adding new state to the IR and forgetting to support it in checkpointing can lead to a bad roll-back state.
This issue is still there in local checkpointing to some extent, but there are two main differences:
- The scope is now reduced so we can afford to enable verifier checks in normal Debug builds. These checks can catch such bugs.
- The scope is local, so testing is a lot easier and any such issue should be easier to track and manage.
3. State of analyses and data-structures that rely on ValueHandles
Unlike in global checkpointing, the state of these external data-structures will obviously be out-of-scope by design, so the user won’t expect them to be automatically updated.
One issue that remains has to do with deletion of Instructions: the callbacks for deletion of Instructions in a BasicBlock are deferred until the changes are accepted with a call to BasicBlock::acceptState()
. I think this behavior makes sense, and is semantically correct because the instruction is not actually deleted until acceptState()
is called. If instead we notify the listeners right away that an Instruction got deleted and we later on roll-back, then that would be incorrect, as the instruction would still be around.
4. Deployment
A local scheme can be developed and deployed gradually, one component at a time, because it is very specific about what state can be saved/restored. New functionality can be developed and added as needed. Adding checkpointing support to more IR objects won’t break the existing code.
New Challenges
- Not as easy to reason about: The user has to be fully aware and be more careful about the exact state that gets saved and restored.
For example: a call toBB->restoreState()
will only rollback the contents of the BB, but will leak memory unlessIRBuilder->restoreState()
is also called. Also ifIRBuilder->restoreState()
is called beforeBB->restoreState()
then this will likely cause a crash as theBB
’s Change objects will have gone stale.