Hi everyone,
I have been working on a checkpointing prototype for LLVM IR and I would like to get your thoughts on it.
Summary
A lightweight checkpointing scheme for LLVM IR with a simple save()
/ rollback()
/ accept()
interface.
Checkpointing allows the user to make arbitrary changes to the LLVM IR and then rollback to the saved state with a call to rollback()
. The benefit of having this as part of the LLVM IR API is that it removes the burden of building custom rollback schemes, which is non-trivial and error-prone.
Transformation passes usually avoid modifying the IR unless they estimate that the new IR is better than the original one. However, these estimations are not always straightforward, or may not be as accurate as if we had actual IR to work with. For such cases a checkpointing scheme may be a better option. By supporting checkpointing natively in the LLVM IR API we are facilitating the use of checkpointing as a design option for transformation passes.
Design
Main design points
- Lightweight: Checkpointing works by tracking IR changes, not by saving the entire IR state. Therefore
save()
is instantaneous and bothrollback()
andaccept()
are linear to the number of changes performed on the IR sincesave()
was called. - Ease of use: The only interaction with the user is through the
save()
,rollback()
,accept()
interface functions. The rest takes place in the background completely hidden from the user. - Compilation time:
- Negligible impact on compilation time when checkpointing is not active.
- Compilation time overhead that is linear to the number of changes, when checkpointing is active and changes are being tracked.
- Pointer consistency: Values that get
erasedFromParent()
will still be accessible using the same pointer after rollback, otherwise data structures will contain stale data.
Checkpoint API
Interacting with the Checkpoint class happens through the use of a CheckpointHandle
object that you can get using:
auto Chkpnt = getContext()->getCheckpointHandle();
The user can now call Chkpnt.save()
to start tracking changes.
- If the user decides to keep the changes, then they can run
Chkpnt.accept()
. - If they decide to revert them, they can use
Chkpnt.rollback()
- If the
Chkpnt
handle goes out of scope, we check whether the user calledaccept()
orrollback()
and crash if none was called.
Example
Given the IR:
define void @foo(i32 %a, i32 %b) {
bb0:
%toDelete = add i32 %a, %b
ret void
}
The user can save state using:
CheckpointHandle Chkpnt = M->getContext().getCheckpointHandle();
Chkpnt.save();
Then they can delete the %toDelete
instruction from the code:
ToDelete->eraseFromParent();
Which would result in this IR:
define void @foo(i32 %a, i32 %b) {
bb0:
ret void
}
Running:
Chkpnt.rollback();
will rollback the state of the IR so it will re-insert %toDelete into
bb0:
define void @foo(i32 %a, i32 %b) {
bb0:
%toDelete = add i32 %a, %b
ret void
}
Tracking Changes
Changes are tracked using sub-classes of the ChangeBase
class, which hold any state required to revert that particular IR change.
The base class ChangeBase is an abstract class that sets the basic interface with the revert() and apply() functions.
class ChangeBase {
// ...
Public:
// ...
virtual void revert() = 0;
virtual void apply() = 0;
// ...
};
Here are some of the change-tracking classes:
ChangeBase -+- SetName // Tracks changes in Value Name
|
+- InsertInstr // Tracks inserting an instr in a BB
|
+- RemoveInstr // Tracks the removal of an instr from BB
|
+- MoveInstr // Tracks moving an instr before/after
|
+- SetUse // Tracks a change in Use definition
|
+- RemoveBB // Tracks removal of a BB from a Function
|
+- MoveBB // Tracks the movement of BB before/after
|
+- DeleteValue // Tracks the deletion of a Value object
|
Instances of these classes are pushed into a vector held by the CheckpointEngine
class whenever a change in the IR takes place. Please note that all members of the CheckpointEngine
class are private. Instruction
, Value
and BasicBlock
are friends so that they can push changes. The instance of CheckpointEngine
is helcode section specify language c++d in LLVMContextImpl
because it needs to be accessible from any IR object that may be modified.
class CheckpointEngine {
friend class Instruction; // For receiving changes
friend class Value; // For receiving changes
friend class BasicBlock; // For receiving changes
friend class Checkpoint; // The handle that calls save/rollback/accept
bool Active = false; // True while change tracking is active
SmallVector<std::unique_ptr<ChangeBase>> Changes;
/// ...
void save();
void rollback();
void accept();
}
Changes get pushed to Checkpoint::Changes whenever an IR change. For example insertBefore()
includes a check whether checkpointing is active and if so pushes calls Chkpnt.insertInstr(this)
which creates an InsertInstr
object and pushed it into the Changes
vector.
void Instruction::insertBefore(Instruction *InsertPos) {
/// ...
CheckpointEngine &ChkpntEngine = getContext().getChkpntEngine();
if (LLVM_UNLIKELY(ChkpntEngine.isActive()))
ChkpntEngine.insertInstr(this);
}
CheckpointHandle API
The user does not interact with CheckpointEngine
directly, but rather with the handle Checkpoint
which has access to its private member functions. The main benefit of using this handle is for being explicit to the user that they need to handle the scope of checkpointing, and check if the user forgot to call accept()
or rollback()
when the handle goes out of scope.
class Checkpoint {
void save();
void rollback();
void accept();
~Checkpoint() { assert(ChkpntEngine.empty() &&
“Missing call to accept() or rollback()”); }
}
Dealing with deletion and eraseFromParent()
Deleting IR objects needs special care, because reverting a deletion needs to bring back the original pointer, otherwise data structures that contain the deleted objects will go stale after a rollback.
We make this work by not actually deleting objects while checkpointing is active. Instead we detach it from its parent (like what removeFromParent()
does) and drop any edges with other objects (like what dropAllReferences()
does). These changes are all tracked.
If we rollback, then we simply insert the object back into the list and reconnect it to its defs/uses. If we don’t rollback we just delete the object when accept()
is called.
Object creation
IR object creation is tracked in the Value
class constructor. A rollback()
will delete the created object.
Rollback()
Rollback iterates through CheckpointEngine::Changes
in reverse, calling revert()
for each of the change objects. This will undo all modifications performed to the IR one by one in reverse order. This is a linear-time operation to the number of changes recorded.
Limitations
- Order of uses: Currently we are not maintaining the order of users. So after a
rollback()
one may observe a reordered users list. This is a subtle change, but it can still be observed in the printed IR, for example in the list of predecessors of a BasicBlock (it usually looks like this:bb1: ; preds = %bb1, %bb2, %bb3
) - Constants and Types: Creating new Types or Constants is not currently being reverted, they will remain even after we rollback.
- Metadata: Changes to metadata are currently only partially supported.
- User data structures: Checkpointing operates only on the IR. It is the user’s responsibility to update any data structures that contain rolled-back IR data. This includes ValueHandles: during a rollback the values tracked by ValueHandles may change.
- Analyses: For now we don’t support checkpointing of analysis passes (and whether we should add support is debatable). For example, block frequencies, edge probabilities or dominator tree, all maintain their own data structures which are not currently being tracked. So any changes in the data owned by these analyses won’t be reverted automatically on a
rollback()
.
Impact on compilation time
- When checkpointing is not active, the overhead of checkpointing is negligible because the code goes through the fast path. The actual checkpointing code is guarded by:
if (LLVM_UNLIKELY(getContext().isActive())) {
// Checkpointing code
}
which does not seem to introduce a lot of overhead.
Some quick measurements I made when compiling the LLVM source tree with a checkpoint-enabled compiler (a normal Release non-LTO build) showed a < 0.5% mean wall time difference over 10 runs on a Skylake server machine, though the runs were quite noisy.
- When checkpointing is active, changes to the IR have to be tracked, so we have to go through the slow path that creates the corresponding
Change
object and pushes it into theCheckpointEngine::Changes
vector. -
Checkpoint::save()
is basically instantaneous. It just sets theActive
boolean to true. -
Checkpoint::rollback()
is linear to the number of changes recorded sincesave()
was called. -
CheckpointApply::accept()
is linear to the number of changes recorded, but most of the changes perform no action except for those that free memory, so it is not too expensive to run.
FAQ
How many hooks do we need?
Currently there are around 50 types of hooks that get tracked by around 500 lines of hook-related code across several files. With these we can successfully track most of the major functionality needed by some common optimization passes.
I am pretty sure there is a long tail of state remaining to be tracked, but I think we can safely reduce the scope of checkpointing to a subset that is useful for most common use-cases.
What happens when checkpointing fails to track some state?
Since checkpointing is in early development I have added an expensive check that dumps the IR at save()
and checks the IR at rollback()
to make sure they are textually equivalent. This catches all bugs in state that is exposed by the LLVM IR’s textual form.
Bugs in state that is not exposed by the textual form of the IR can result in silent bugs, but in my experience assertions in the LLVM code will catch many of them.
How about Testing?
Apart from a long list of unit tests, we can test checkpointing by modifying the existing lit tests to get them to run a checkpoint-save
and a chekcpoint-rollback
pass around the passes that the lit test is already running in the -passes
list (like so : -passes=checkpoint-save,<pass(es)>,checkpoint-rollback
), and modifying CHECK
directives to test that the code remains unmodified after a rollback. This is the main testing method I am currently using to expose bugs and missing features.
Can we rollback whole passes?
Yes (this is actually how we are testing checkpoint using lit tests), but this would most probably result in a very long list of changes that need to be rolled-back. This checkpointing scheme is designed for local changes.