[RFC] Lightweight LLVM IR Checkpointing

I think the main concerns have to do with (please correct me if I am wrong):

  • Checkpointing coverage, and whether supporting a reduced set of the IR is a viable approach.
    • I think not supporting 100% of the IR is inevitable, at least in the beginning. The scenario that highlights the problem that could be caused by this is the following: Consider a pass that calls an external function foo() within a code region where checkpointing is active. If at some point foo() is modified by a developer who is not aware of the limitations of checkpointing, then this change can cause a crash.
      I don’t think that we have a good way to completely avoid this problem, but we can mitigate it by having a good-enough coverage that makes this less likely to happen.
  • Adding new functionality to the IR and forgetting to support it in checkpointing.
    • We cannot guarantee that this won’t happen ever. The tools that we have to mitigate this are: (i) Refactoring the code that is likely to change such that accesses are funneled through centralized APIs, and (ii) Relying on existing tests to catch any such change.
  • ValueHandles
    • I think the safest option is to prohibit their use while checkpointing is active, by triggering a crash if there are any registered listeners. The reasoning is that their listeners may not have state that can be safely reverted, so a rollback may result in a listener’s state that is different from what is expected. Down the road we may enable their use for specific analyses with state that can be reversed.

For the first issue, how different is it from other ways of creating illegal IR in writing a transformation pass? The problems like this are usually caught during the development stage. The fact that it will crash compiler or at runtime is an indicator or easier task for debugging. Of course adding verification tools will be useful here.

The second issue does not sounds like a new issue either. For instance, it quite common to have new code introduced failing to update existing meta data. Again, improving tooling is desired

1 Like

This can be caught by checkpoint’s verifier checks that compare the rolled back IR against the saved one. So it will be a compiler crash, similar to an IR verifier crash triggered by the caller of that function during rollback().

Yes this is a quite common problem when trying to keep two independent entities in sync. And there is usually no great solution for it either.

I just sent out an RFC about an alternative local scheme: [RFC] Local Per-Component IR Checkpointing . Its design is quite similar but I think it has a lot of advantages compared to this one. Please let me know what you think.