[RFC] Lightweight LLVM IR Checkpointing

vporpo · March 1, 2023, 2:05am

I think the main concerns have to do with (please correct me if I am wrong):

Checkpointing coverage, and whether supporting a reduced set of the IR is a viable approach.
- I think not supporting 100% of the IR is inevitable, at least in the beginning. The scenario that highlights the problem that could be caused by this is the following: Consider a pass that calls an external function foo() within a code region where checkpointing is active. If at some point foo() is modified by a developer who is not aware of the limitations of checkpointing, then this change can cause a crash.
  I don’t think that we have a good way to completely avoid this problem, but we can mitigate it by having a good-enough coverage that makes this less likely to happen.
Adding new functionality to the IR and forgetting to support it in checkpointing.
- We cannot guarantee that this won’t happen ever. The tools that we have to mitigate this are: (i) Refactoring the code that is likely to change such that accesses are funneled through centralized APIs, and (ii) Relying on existing tests to catch any such change.
ValueHandles
- I think the safest option is to prohibit their use while checkpointing is active, by triggering a crash if there are any registered listeners. The reasoning is that their listeners may not have state that can be safely reverted, so a rollback may result in a listener’s state that is different from what is expected. Down the road we may enable their use for specific analyses with state that can be reversed.

davidxl · March 1, 2023, 4:25am

For the first issue, how different is it from other ways of creating illegal IR in writing a transformation pass? The problems like this are usually caught during the development stage. The fact that it will crash compiler or at runtime is an indicator or easier task for debugging. Of course adding verification tools will be useful here.

The second issue does not sounds like a new issue either. For instance, it quite common to have new code introduced failing to update existing meta data. Again, improving tooling is desired

vporpo · March 1, 2023, 8:37pm

This can be caught by checkpoint’s verifier checks that compare the rolled back IR against the saved one. So it will be a compiler crash, similar to an IR verifier crash triggered by the caller of that function during rollback().

Yes this is a quite common problem when trying to keep two independent entities in sync. And there is usually no great solution for it either.

vporpo · March 8, 2023, 6:22pm

I just sent out an RFC about an alternative local scheme: [RFC] Local Per-Component IR Checkpointing . Its design is quite similar but I think it has a lot of advantages compared to this one. Please let me know what you think.

Topic		Replies	Views
[RFC] Local Per-Component IR Checkpointing IR & Optimizations	7	661	March 30, 2023
Add Call instruction in IR for a non-existing function LLVM Dev List Archives	4	66	April 8, 2016
Improvements to `llvm` dialect exception-related operations MLIR	2	331	March 30, 2023
[RFC] Abstract Parallel IR Optimizations LLVM Dev List Archives	3	101	June 12, 2018
Hi Cache Miss and Branch Misprediction LLVM Dev List Archives	1	75	September 30, 2008

[RFC] Lightweight LLVM IR Checkpointing

Related Topics