Can you expand on what you mean here? I didn’t follow.
I didn’t address abortBuilderFn before because I don’t quite understand how it is supposed to work, especially without standardizing on i1. Who is providing this callback in the context of this pass?
How can it be decoupled for the op if the op is responsible (through the interface) for building the “invariant checking code” it there isn’t a standard expectation that i1 is what you provide to abortBuilderFn?
I think my plan for using i1s to return instead of throw was not as simple as I thought due to examples like Matthias showed which cannot simply be returned without modifying arbitrary containing ops, so I’m giving up on that hope for the interface:
My interpretation of a useful abortBuilderFn is one that does not take any Values created by generated IR. Instead, something like cf.assert %bool, "msg" would be created by
scf.if %bool {
abortBuilderFn("C++ String Passed To Callback")
}
where the implementer of the interface creates some logic and calls a function that creates a non-predicated, throwing Operation. This then prevents a requirement of any given type or any given dialect/instruction. The choice of operation would also be specified by the pipeline or pass operating on that interface which allows the form of throwing-operation to be configurable.
OK I see, there is much more than i1 vs non-i1 at hand here: it forces to introduce control flow where there isn’t without the abortBuilderFn, I’m not sure this is a good idea. Also “C++ string passed to callback” does not directly allow “rich” parameters (along the line of what shown in [RFC] Runtime Op Verification - #10 by mehdi_amini above, which really requires a standard “assert” op for lowering purpose).
Regardless of whether we have an abortBuilderFn or directly generate cf.assert (or whatever op the interface impl wants to create), we can extend abortBuilderFn and/or cf.assert.
E.g., if we go with an abortBuilderFn:
/// A callback that build IRs to abort program execution.
///
/// An optional error message may be used by a runtime to propagate the error
/// to the user. Additional error message args can be provided to produce more
/// descriptive error messages with runtime information. Message args are
/// referenced with "{}" placeholders inside the error message string. The
/// number of message args must match the number of placeholders.
using AbortBuilderFn = std::function<void(OpBuilder &, Location loc,
StringRef msg,
ValueRange placeholders)>;
/// Generate IR to verify this op at runtime. If runtime verification fails,
/// the generated IR is supposed to abort program execution using IR generated
/// by `abortBuilder`.
void generateRuntimeVerification(OpBuilder &builder, Location loc,
AbortBuilderFn abortBuilder);
If an abortBuilderFn impl does not know how to print a Value, it can omit it and leave the {} in the string. (Or we can give it a LogicalResult return value so that it can indicate failure to convert to string representation.)
We can have a default -cf-generate-runtime-verification pass that uses an abortBuilderFn that creates cf.abort ops; with support for some commonly used low-level placeholder types (IntegerType, FloatType in the beginning) in the ControlFlowToLLVMConversion lowering.
A different abortBuilderFn may also choose to drop the error message along with all placeholders.
Side note: I expect op verification IR of complex ops to sometimes contain control flow (even in the i1 version). E.g., when dealing with unranked tensors/memrefs, we may want to generate a for-loop from 0 to rank.
+1 on adding runtime verification and providing this through means of an interface.
One minor suggestion, would it be possible to add an extra parameter (int or enum) that controls the level of runtime verification somehow (ranging from basic and cheap assertions to pedantic but expensive assertions)?
and then verification can be controlled as follows
-runtime-op-verify // the default
-runtime-op-verify=basic
-runtime-op-verify=xxxx
...
-runtime-op-verify=pendantic
This way, the implementor of the interface can choose to simply ignore the flag if all tests are relatively cheap, or implement a progression of tests that range from e.g. simple O(1) tests under basic to more expensive tests under pedantic). Or, instead of a progressive level, we can come up with a mechanism that controls the kind of checks that are generated (a la ubsan, memsan, asan, …).
I’d like to push a bit further on this idea. Specifically, I’d like to:
Form a set of guidelines regarding what should be verified by runtime verification and what should not.
Explore what kind of runtime verifications are currently missing but would be useful.
What should be verified?
Anything that the static op verifier already verifies, but extended to the dynamic semantics of the operation. Such cases are typically described as “undefined behavior” in the op description. Example: memref.subview does not allow static out-of-bounds slices (if we merge this), so dynamic out-of-bounds slices are something that the runtime op verifier should look for.
Anything else that the op description describes as undefined behavior, if possible with a reasonable amount of effort.
For performance reasons, try to avoid verify things that are already caught by the static op verifier.
Encourage users to verify on a higher level. E.g., we have runtime verification for linalg structured ops to ensure that the tensors have “compatible” runtime dimensions according to the indexing maps. You could probably also verify that after everything has been lowered to the memref+SCF level (there are no linalg ops anymore), but verification on the linalg level will give more descriptive error messages.
This is what I came up with, do you agree? Is something missing?
Current State of Runtime Verification
We currently have verification for the following ops (some still in review).
memref.assume_alignment: Check for incorrect alignment.
memref.cast: Check for invalid casts (e.g., incorrect rank / dim sizes).
memref.copy: Assert that src/dest have the same shape. Partly implemented.
memref.dim: Check for out-of-bounds dimension.
memref.expand_shape: Verify that total number of elements does not change.
memref.load/store/atomic_rmw: Check for out-of-bounds access.
memref.subview: Check for out-of-bounds slice.
linalg structured op: Check runtime sizes of operands.
I’d like to hear your opinion about other kind of verifications that would be useful. Some of these may not fall under the category of “undefined behavior”, so we would need a way to selectively trigger a certain subset of checks. Some things that came to mind:
arith.divf/divsi/divui/remf/remsi/remui: Check for division-by-zero.
arith.addi/muli/subi: Check for integer overflow if “no unsigned wrap” / “no signed wrap” modifiers are present.
arith.trunci: Check that no zero bits are discarded.
memref.alloc/dealloc: Check for memory leaks and double deallocs. I have a prototype that detects those based on extra helper functions in the runtime library.
Verification for most tensor dialect ops to check for out-of-bounds accesses, slices, etc. Same as memref. (Not all tensor IR lowers to memref, sometimes it is vectorized.)
llvm.extractelement/insertelement: Check for out-of-bounds access.
llvm.extractelement/extractvalue: Check for extraction of uninitialized elements. (Not sure if useful and not sure if it can be implemented.)
Is there any functionality that you’d like to have runtime verification for?
I’d also like understand what’s the best way to interact with the ub dialect and poison. It is my understanding that “executing” poison (e.g., executing a basic block that contains ub.poison) is not a problem by itself. But what if a ub.poison value is passed as an operand to a side-effecting operation?
Would it be desirable to detect cases where a memref.store stores a poisoned value into memory? Or a vector.print printing a poisoned value? Is it even possible to detect this? How would I even know that I’m about to store a poisoned value? The only solution that I can think of: replace poison with the same random number (in the case of integers/floats) throughout the entire program and check for that value before storing to memory. This has false positives, but repeated program executions would be able to detect UB with high certainty.
But what if a ub.poison value is passed as an operand to a side-effecting operation?
If MLIR follows LLVM semantics (which I think it should), then storing or reading poison in memory is valid. It is only UB when branching on a value that is poisoned (in a cf.cond_br), or on operations that may trigger UB depending on the value. For instance, divsi(poison, nonpoison) is poison, but divsi(nonpoison, poison) is UB. Also, note that poison is quite complex to represent in memory, and depends if the values you store are pointers or just integers (An SMT Encoding of LLVM’s Memory Model for Bounded Translation Validation | SpringerLink).
The only solution that I can think of: replace poison with the same random number (in the case of integers/floats) throughout the entire program and check for that value before storing to memory.
Detecting poison this way would mean that the lowered arith operations would have to always check if their operands is this specific constant, to propagate poison correctly. Also, how would you do that for i1 (which is the type that triggers poison in arith.cond_br), since both 0 or 1 are very likely to appear, leaving no space for a “poison marker”.
The best way I would have in mind to deal with poison explicitely this is to have a lowering that compiles iN into i(N+1) or tuple<iN, i1> to have an explicit poison flag.
Otherwise, just some notes about the semantics of some operations:
arith.trunci: Check that no zero bits are discarded.
This should only happen whenever the nuw flag is on AFAIK
arith.divf/divsi/divui/remf/remsi/remui: Check for division-by-zero
It does for scalar and vector types, but op semantics outside of arith/vector are not explicit about this. Ultimately it’s for each dialect to decide, either through explicit documentation or implicitly via lowering (which would be the case for memref load/store).
+1. It’s the same as with any other sanitizer: quality of the implementation matters, and we want some efficient encoding and the ability to elide checks that are statically known not to fire. The difficulty is probably that you’d have to have multiple dialects agree on some shared instrumentation format / convention.
Big +1 to this. We need more “rationale”-style documentation explaining why we do or do not certain things. This may be tricky to get right, so it is fine to iterate on (and explicitly mention in the documentation that we welcome RFCs to evolve the guidelines). One thing I’d encourage is to also provide examples of things that clearly don’t fit into the runtime verification idea.
A large class of examples is doing a full-fledged abstract interpretation over the IR to detect potential bugs. E.g., propagate ranges across integer operations using the dataflow framework to see if the divisor has zero in its range and warn or something similar.
Note that this assumes the higher level exists at some point in the program. This is not always the case, so we shouldn’t just skip verification at the SCF/memref level because something could have been verified at Linalg level. When we come from CIR, there’s no Linalg, for example. We should, however, skip verification that would be “too complex” by some metric.