[RFC] Lightweight LLVM IR Checkpointing

Hi everyone,

I have been working on a checkpointing prototype for LLVM IR and I would like to get your thoughts on it.

Summary

A lightweight checkpointing scheme for LLVM IR with a simple save() / rollback() / accept() interface.

Checkpointing allows the user to make arbitrary changes to the LLVM IR and then rollback to the saved state with a call to rollback(). The benefit of having this as part of the LLVM IR API is that it removes the burden of building custom rollback schemes, which is non-trivial and error-prone.

Transformation passes usually avoid modifying the IR unless they estimate that the new IR is better than the original one. However, these estimations are not always straightforward, or may not be as accurate as if we had actual IR to work with. For such cases a checkpointing scheme may be a better option. By supporting checkpointing natively in the LLVM IR API we are facilitating the use of checkpointing as a design option for transformation passes.

Design

Main design points

  • Lightweight: Checkpointing works by tracking IR changes, not by saving the entire IR state. Therefore save() is instantaneous and both rollback() and accept() are linear to the number of changes performed on the IR since save() was called.
  • Ease of use: The only interaction with the user is through the save() , rollback(), accept() interface functions. The rest takes place in the background completely hidden from the user.
  • Compilation time:
    • Negligible impact on compilation time when checkpointing is not active.
    • Compilation time overhead that is linear to the number of changes, when checkpointing is active and changes are being tracked.
  • Pointer consistency: Values that get erasedFromParent() will still be accessible using the same pointer after rollback, otherwise data structures will contain stale data.

Checkpoint API

Interacting with the Checkpoint class happens through the use of a CheckpointHandle object that you can get using:

auto Chkpnt = getContext()->getCheckpointHandle();

The user can now call Chkpnt.save() to start tracking changes.

  • If the user decides to keep the changes, then they can run Chkpnt.accept().
  • If they decide to revert them, they can use Chkpnt.rollback()
  • If the Chkpnt handle goes out of scope, we check whether the user called accept() or rollback() and crash if none was called.

Example

Given the IR:

  define void @foo(i32 %a, i32 %b) {                                                                                                                                             
  bb0:                                                                                                                                                                           
    %toDelete = add i32 %a, %b                                                                                                                                                   
    ret void                                                                                                                                                                     
  }  

The user can save state using:

CheckpointHandle Chkpnt = M->getContext().getCheckpointHandle();
Chkpnt.save();

Then they can delete the %toDelete instruction from the code:

ToDelete->eraseFromParent();

Which would result in this IR:

  define void @foo(i32 %a, i32 %b) {                                                                                                                                             
  bb0:                                                                                                                                                                           
    ret void                                                                                                                                                                     
  } 

Running:

Chkpnt.rollback();

will rollback the state of the IR so it will re-insert %toDelete into

bb0:
  define void @foo(i32 %a, i32 %b) {                                                                                                                                             
  bb0:                                                                                                                                                                           
    %toDelete = add i32 %a, %b                                                                                                                                                   
    ret void                                                                                                                                                                     
  }  

Tracking Changes

Changes are tracked using sub-classes of the ChangeBase class, which hold any state required to revert that particular IR change.
The base class ChangeBase is an abstract class that sets the basic interface with the revert() and apply() functions.

class ChangeBase {
// ...
Public:
   // ...
   virtual void revert() = 0;
   virtual void apply() = 0;
   // ...
};

Here are some of the change-tracking classes:

ChangeBase -+- SetName     // Tracks changes in Value Name
            |
            +- InsertInstr // Tracks inserting an instr in a BB
            |
            +- RemoveInstr // Tracks the removal of an instr from BB
            |
            +- MoveInstr   // Tracks moving an instr before/after
            |
            +- SetUse      // Tracks a change in Use definition
            |
            +- RemoveBB    // Tracks removal of a BB from a Function
            |
            +- MoveBB      // Tracks the movement of BB before/after
            |
            +- DeleteValue // Tracks the deletion of a Value object
            |

Instances of these classes are pushed into a vector held by the CheckpointEngine class whenever a change in the IR takes place. Please note that all members of the CheckpointEngine class are private. Instruction, Value and BasicBlock are friends so that they can push changes. The instance of CheckpointEngine is helcode section specify language c++d in LLVMContextImpl because it needs to be accessible from any IR object that may be modified.

class CheckpointEngine {
   friend class Instruction; // For receiving changes
   friend class Value;       // For receiving changes
   friend class BasicBlock;  // For receiving changes

   friend class Checkpoint; // The handle that calls save/rollback/accept

   bool Active = false;  // True while change tracking is active
   SmallVector<std::unique_ptr<ChangeBase>> Changes;
   /// ...
   void save();
   void rollback();
   void accept();
}

Changes get pushed to Checkpoint::Changes whenever an IR change. For example insertBefore() includes a check whether checkpointing is active and if so pushes calls Chkpnt.insertInstr(this) which creates an InsertInstr object and pushed it into the Changes vector.

void Instruction::insertBefore(Instruction *InsertPos) {
  /// ...
  CheckpointEngine &ChkpntEngine = getContext().getChkpntEngine();
  if (LLVM_UNLIKELY(ChkpntEngine.isActive()))
    ChkpntEngine.insertInstr(this);
 }

CheckpointHandle API

The user does not interact with CheckpointEngine directly, but rather with the handle Checkpoint which has access to its private member functions. The main benefit of using this handle is for being explicit to the user that they need to handle the scope of checkpointing, and check if the user forgot to call accept() or rollback() when the handle goes out of scope.

class Checkpoint {
   void save();
   void rollback();
   void accept();
   ~Checkpoint() { assert(ChkpntEngine.empty() &&
                          “Missing call to accept() or rollback()”); }
}

Dealing with deletion and eraseFromParent()

Deleting IR objects needs special care, because reverting a deletion needs to bring back the original pointer, otherwise data structures that contain the deleted objects will go stale after a rollback.

We make this work by not actually deleting objects while checkpointing is active. Instead we detach it from its parent (like what removeFromParent() does) and drop any edges with other objects (like what dropAllReferences() does). These changes are all tracked.
If we rollback, then we simply insert the object back into the list and reconnect it to its defs/uses. If we don’t rollback we just delete the object when accept() is called.

Object creation

IR object creation is tracked in the Value class constructor. A rollback() will delete the created object.

Rollback()

Rollback iterates through CheckpointEngine::Changes in reverse, calling revert() for each of the change objects. This will undo all modifications performed to the IR one by one in reverse order. This is a linear-time operation to the number of changes recorded.

Limitations

  • Order of uses: Currently we are not maintaining the order of users. So after a rollback() one may observe a reordered users list. This is a subtle change, but it can still be observed in the printed IR, for example in the list of predecessors of a BasicBlock (it usually looks like this: bb1: ; preds = %bb1, %bb2, %bb3 )
  • Constants and Types: Creating new Types or Constants is not currently being reverted, they will remain even after we rollback.
  • Metadata: Changes to metadata are currently only partially supported.
  • User data structures: Checkpointing operates only on the IR. It is the user’s responsibility to update any data structures that contain rolled-back IR data. This includes ValueHandles: during a rollback the values tracked by ValueHandles may change.
  • Analyses: For now we don’t support checkpointing of analysis passes (and whether we should add support is debatable). For example, block frequencies, edge probabilities or dominator tree, all maintain their own data structures which are not currently being tracked. So any changes in the data owned by these analyses won’t be reverted automatically on a rollback().

Impact on compilation time

  • When checkpointing is not active, the overhead of checkpointing is negligible because the code goes through the fast path. The actual checkpointing code is guarded by:
if (LLVM_UNLIKELY(getContext().isActive())) {
  // Checkpointing code
}

which does not seem to introduce a lot of overhead.
Some quick measurements I made when compiling the LLVM source tree with a checkpoint-enabled compiler (a normal Release non-LTO build) showed a < 0.5% mean wall time difference over 10 runs on a Skylake server machine, though the runs were quite noisy.

  • When checkpointing is active, changes to the IR have to be tracked, so we have to go through the slow path that creates the corresponding Change object and pushes it into the CheckpointEngine::Changes vector.
  • Checkpoint::save() is basically instantaneous. It just sets the Active boolean to true.
  • Checkpoint::rollback() is linear to the number of changes recorded since save() was called.
  • CheckpointApply::accept() is linear to the number of changes recorded, but most of the changes perform no action except for those that free memory, so it is not too expensive to run.

FAQ

How many hooks do we need?

Currently there are around 50 types of hooks that get tracked by around 500 lines of hook-related code across several files. With these we can successfully track most of the major functionality needed by some common optimization passes.
I am pretty sure there is a long tail of state remaining to be tracked, but I think we can safely reduce the scope of checkpointing to a subset that is useful for most common use-cases.

What happens when checkpointing fails to track some state?

Since checkpointing is in early development I have added an expensive check that dumps the IR at save() and checks the IR at rollback() to make sure they are textually equivalent. This catches all bugs in state that is exposed by the LLVM IR’s textual form.

Bugs in state that is not exposed by the textual form of the IR can result in silent bugs, but in my experience assertions in the LLVM code will catch many of them.

How about Testing?

Apart from a long list of unit tests, we can test checkpointing by modifying the existing lit tests to get them to run a checkpoint-save and a chekcpoint-rollback pass around the passes that the lit test is already running in the -passes list (like so : -passes=checkpoint-save,<pass(es)>,checkpoint-rollback), and modifying CHECK directives to test that the code remains unmodified after a rollback. This is the main testing method I am currently using to expose bugs and missing features.

Can we rollback whole passes?

Yes (this is actually how we are testing checkpoint using lit tests), but this would most probably result in a very long list of changes that need to be rolled-back. This checkpointing scheme is designed for local changes.

1 Like

I’ve always thought this would be a nice feature to have precisely for the use cases that you mention. My concern is that this will enable design patterns that, over time, will rely too heavily on checkpointing as a means of optimization. I think a backtracking approach to transformations that use this is worse overall than having a suboptimal transformation.

1 Like

Good point, I am not sure how to discourage unnecessary use. Perhaps forcing a small limit to the number of tracked changes could help.

I’m somewhat skeptical about this proposal, on a few fronts:

  • I believe that generally speaking, we have a strong preference to have separate analysis and transformation phases, instead of interleaving analysis and transforms. This is more robust and, if done right, also tends to be more powerful. Of course, doing so is not always feasible (without disproportionate effort), so there is certainly room for this kind of checkpointing approach. However, I’m not convinced that such cases are common enough that an intrusive core IR mechanism for this is justified.
  • We have a few pieces of code that already use a form of “checkpointing”, but these cases are pretty much only interested in deleting newly inserted instructions if the transform is aborted. This is pretty simple to do via IRBuilder callbacks, and we could easily add a slightly abstracted wrapper to further simplify this. It sounds like your proposal is a lot more ambitious, covering everything from instruction insertions, to operand and attribute changes, and (just going by your recent patches) changes to globals. It’s not really obvious to me where we’re going to need this level of change tracking.
  • I’m not convinced that this is going to have negligible compile-time impact. It sounds like we’ll have to introduce checks in quite a few places. Is there a patch that can be tested? I wouldn’t consider the 0.5% regression you name negligible either, for functionality that could be implemented in a free way if it does not hook into core IR.
  • You already raise the concern, but given that this tries to track all IR changes, it seems easy to fail to account for some detail, which will only manifest in rare circumstances. From past experience, just keeping IR state comparison (haveSameSpecialState and friends) as well as MergeFunctions (FunctionComparator) synchronized with IR changes has been a challenge, with omissions regularly being found months later.

Overall, I think that some form of “checkpointing” infrastructure may be useful, but I don’t think it should have any core IR integration. If all checkpointed changes are funneled through a specific class (presumably some IRBuilder wrapper), this gives a way to rollback changes without core IR support, which means you only pay for it when you use it, and it can be limited in the changes it supports to only the parts that are actually useful to somebody.

2 Likes

+1

We may want to discuss the feature in the context of an intended target application and see if the IRBuilder wrapper based alternative works or not.

I believe that generally speaking, we have a strong preference to have separate analysis and transformation phases, instead of interleaving analysis and transforms. This is more robust and, if done right, also tends to be more powerful.

Using checkpointing does not move away from this pattern at all. You can still rely on separate analysis and transformation phases. What it does help you with is avoiding designs that rely on maintaining a custom-made IR-like abstraction on top of actual IR, which not only creates a lot of duplication but is also very error-prone as you need to synchronize them constantly, which often fails causing all sort of issues.

However, I’m not convinced that such cases are common enough that an intrusive core IR mechanism for this is justified.

They may not be that common, but there are a couple of cases where that would simplify the design and accuracy of a transformation quite considerably. Consider for example the vectorizers and their model of the vector code which is always lacking against the actual IR, and needs to be constantly maintained to be in sync with it. Then there is the cost modeling aspect of it which could be improved if we considered actual IR where all the instruction features and dependencies are visible and not have to use a poorly modeled abstractions of it.

It sounds like your proposal is a lot more ambitious, covering everything from instruction insertions, to operand and attribute changes, and (just going by your recent patches) changes to globals. It’s not really obvious to me where we’re going to need this level of change tracking.

Well, an ideal checkpointing scheme would be able to rollback everything, which of course is not always feasible. It is actually not very hard to support a very wide range of changes, but I think it would be very useful if it could revert local changes like deletion/creation of instructions and BBs, operand updates, name changes, in general the most common things that a transformation pass may do on local IR.

I’m not convinced that this is going to have negligible compile-time impact. It sounds like we’ll have to introduce checks in quite a few places. Is there a patch that can be tested?

Yes, I can make it available on github and have it measured.

You already raise the concern, but given that this tries to track all IR changes, it seems easy to fail to account for some detail, which will only manifest in rare circumstances.

I agree, it is indeed easy to fail, which is why aiming at tracking and rolling back as many things as possible make it a lot easier to test.

Overall, I think that some form of “checkpointing” infrastructure may be useful, but I don’t think it should have any core IR integration. If all checkpointed changes are funneled through a specific class (presumably some IRBuilder wrapper), this gives a way to rollback changes without core IR support, which means you only pay for it when you use it, and it can be limited in the changes it supports to only the parts that are actually useful to somebody.

It would be nice to be able to support this without core IR integration, but tracking basic things like operand changes needs to be done in the core IR classes.

We may want to discuss the feature in the context of an intended target application and see if the IRBuilder wrapper based alternative works or not.

I doubt that the IRBuilder would be adequate, but something similar would be implementing checkpointing as a class with functions that modify the IR and track it at the same time, something like: ChkpntTracker.setOperand(I, OpIdx, Operand), ChkpntTracker.replaceAllUsesWith(V1, V2), ChkpntTracker.createInstruction(), Chkpnt.Tracker.moveInstructionBefore(I, BeforeI) etc.
There are a couple of issues with this:

  • You can’t use existing LLVM code, you have to rewrite the code specifically with checkpointing in mind. A very useful feature of integrating checkpointing in the IR is that you can support LLVM’s existing functions, like for example you could call llvm::InlineFuncttion() and then rollback if needed without having to write a Checkpoint-specific form of that code.
  • This also makes it a lot harder to test and verify, as we would not be able to use existing code or passes for testing.
  • There is a lot of duplication: these functions will have to duplicate a significant part of the LLVM IR API. Keeping them from diverging will be very tricky.
  • Tracking has to be done at a relatively high level, so instead of relying on tracking the changes at the most efficient point we would have to implement a large number of high-level functions instead. For example, since we won’t be able to track changes to the Use edge, we would have to implement all functions that result in changes of the Use edge, including setOperand, RAUW, RUOW, etc.

Without Core IR support, you force coupling between every single source of IR modification and the “listener” of these modifications, I don’t understand why such a coupling in the design would be desirable, it seems much more restrictive and is unlikely to provide the feature that this proposal does.

I agree that overhead of LLVM should be looked at very closely, but that’s a somehow different question.

The vectorizer is moving to VPlan, which uses its own multi-versioning system (VRecipes) and would not profit from this concept, since the commit/rollback is already a core part of the VPlan design.

The two main ways I can see this working is to either have hooks on every IRBuilder API entry-point to hold both old and new state (stacking for changes on the same area) or to clone the entire IR, neither of them have negligible impact. This is, of course, naive implementations, but you’re bound to have to carry around the old and the new in some form.

What do you mean by “LLVM code”? The actual compiler code or LLVM bitcode?

The vectorizer is moving to VPlan, which uses its own multi-versioning system (VRecipes) and would not profit from this concept, since the commit/rollback is already a core part of the VPlan design.

I was thinking more about the SLP vectorizer. But yeah, I think VPlan is a great example of why checkpointing is useful: if checkpointing was around there wouldn’t be a need for a custom-IR-looking-abstraction-on-top-of-LLVM-IR to get the commit/rollback functionality.

The two main ways I can see this working is to either have hooks on every IRBuilder API entry-point to hold both old and new state

Hmm wouldn’t that restrict tracking to only the creation of instructions, types etc. ? How would you track an operand change by just adding hooks to the IRBuilder ?

or to clone the entire IR,

I think cloning the IR is impractical, not just because of the performance hit, but also your data structures that point to the IR will go stale on a rollback.

What do you mean by “LLVM code”? The actual compiler code or LLVM bitcode?

I mean the c++ code in llvm functions/passes. For example, calling any existing helper function (like function inlining) after a call to Chkpnt.save() gets tracked and can be rolled back if needed.

What about VPlan for SLP?

SLP should end up as a separate recipe, but that’s an orthogonal discussion.

Though, to be clear, VPlan isn’t a checkpointing example. There is no rollback or commit at all, and that’s by design, not because we didn’t have checkpoints. Running VPlans as checkpoints would be an inferior implementation.

VPlans allow for an efficient search space before even attempting to modify the IR, something that LLVM IR would never satisfy, especially one with checkpoint in mind, with external tracking and non-trivial relationship.

Exactly, this is the “doomsday” scenario. :slight_smile:

Sorry, I didn’t mean the IRBuilder class, I mean “IR building APIs”. Any rewrite, builder, replace all users, etc. Anything that modifies IR in any way. Missing one of those means you rollback to a different place, and end up with probably broken IR. That’s a lot of changes…

A checkpoint/rollback scheme for SLP would be great to explore alternatives.

I agree, but you can’t always do all the cost/benefit analysis beforehand, and for those cases a checkpointing scheme could be a design option.

Yeah this is the approach I followed in the prototype, and it turns out there aren’t too many places where the IR gets modified.

1 Like

To be honest, IR isn’t good for that either… :confused: We used IR in the original vectorizer and we ended up with too many hacky heuristics.

Now I’m intrigued. :slight_smile:

I just uploaded the Checkpoint prototype patch along with a squashed patch for 4 NFC patches here (perf/checkpoint branch): GitHub - vptest1/llvm-project at perf/checkpoint
@nikic could you please add this to the tracker so we can get an accurate measurement of the compilation time overhead?

@rengolin Each hook is guarded by if (LLVM_UNLIKELY(Chkpnt, so as you can see here: IR Checkpointing prototype · llvm/llvm-project@0e85511 · GitHub you don’t need a huge number of hooks to track the IR state. There are currently 95 such places across several files, and this covers a very wide range of changes, including Module-level stuff, function attributes etc. Please check out unittests/IR/CheckpointTest.cpp for the kind of changes that are currently being supported by the patch. This does not mean that we should not reduce the scope to something smaller, like for example local changes, and if we do so then we would need a much smaller number of hooks.

Just had a look and what worries me more is not the number of such changes, but the requirement of everything being opt-in. This mean whatever we forget to add, won’t be rolled back and we won’t notice until someone encounters a crazy bug.

There are a number of concerns that are raised with this approach, some commented here already. The summary from me (and I may be missing things) are:

  1. Being opt-in means we need to know beforehand everything that needs adding or risk being left in an unstable state after a rollback.
  2. Not tracking analyses means the pass manager will have to also track rollbacks and re-run the analysis again, even if the pass hadn’t change the IR at all, in case the “unchanged IR” is actually the result of a change+rollback.
  3. Not tracking external structures (especially ValueHandles) means existing instrumentation may stop working or get to an unstable state. Enough passes use those structures to be left as “an exercise to the reader”.
  4. Even if we cover all upstream cases, downstream users could be in a state where they’ll have to implement rollback support for all of their passes, regardless of their own usage, if there’s a chance some future (upstream) pass can roll back stuff and break the downstream information.
  5. Not keeping operand order may not play well with canonicalization, which means the pass manager will also have to run it again if there has been any rollbacks.

I’m not too worried about constants, because they can normally be easily cleaned up, but that also means we need to make sure we run enough cleanups at the end of the pipeline, after all possibilities of roll backs have gone past.

This is indeed a concern, we want to be able to add, for example, another function attribute without having to worry about breaking checkpointing (or even worse breaking it without realizing it). I think the only way to deal with this is to make sure changes to such datastructures are funneled through centralized points that cannot be bypassed, and making sure that tracking is done there. See for example ⚙ D144040 [NFC][IR] Force accesses to Function attributes go through getters/setters . This won’t protect us from larger-scale changes though, but these may be destructive enough to break existing tests.

Why would the pass manager need to know about rollbacks? Shouldn’t it be the responsibility of the pass using checkpointing to do so if needed?

This is true, you can’t just go ahead and use whatever existing code is available, which is why the scope of checkpointing needs to be restricted. Regarding ValueHandles, and any limitations that may be out of scope, I think we need to warn the user by triggering a crash.

In general the user’s datastructures won’t be affected because checkpointing maintains the pointers. The only structures that will missbehave are those tracking the destruction of objects. I don’t think there is a way around it in any checkpointing scheme, so that would have to be clearly documented so that the user is warned about it. I don’t think though that this is a major concern.

We have two ways to make sure we don’t end up in an unstable state:

  • Using expensive checks that compare the IR after rollback against the saved one and warn the user if it they differ, and
  • Triggering assertions when a feature is about to be used that is not being handled, like in such things like ValueHandles.

I don’t understand this point. Why would downstream users need to implement rollback support? If they have made changes to the IR core and have decided not to use checkpointing in their own passes, they won’t need to support it. Checkpointing is off by default.

This can be fixed. Checkpointing just needs to track the index of the use and make sure it gets inserted at the right place on rollback. This may be expensive, but the user could opt-out if needed. Checkpointing is already doing similar tracking for other changes, so that things remain in their original order.