[RFC][clang][dataflow] SignAnalysis, edgeTransfer, branchTransfer

Hi there,

I am experimenting with Clang’s relatively new dataflow analysis framework to implement a simple sign analysis. I really appreciate the framework as it nicely follows the lattice theory, so it was very easy to advance for the first few test cases.

Then I wrote the following test case, which I wanted to pass in a TDD fashion.

  std::string Code = R"(
    int foo();
    void fun() {
      int a = foo();
      if (a > 0) {
        (void)0;
        (void)1;
        // [[p]]
      }
    }
  )";
  RunDataflow(Code, UnorderedElementsAre(
                        Pair("p", HoldsSignLattice(UnorderedElementsAre(
                                      Pair(Var("a"), Positive())
                                      )))
                        ));

I ran into a problem, that during the transfer function how to tell which branch did we take? Alas, when we analyze the (void)0; statement, it is not trivial to tell that the condition of the if is evaluated as true of false. The only thing I could do is to save the condition expression a < 0 in the previous call to transfer (i.e in the Pred node of the CFG) and then in later tranfer calls I could query if the flow condition implies that condition. This had to be done redundantly for the second statement as well in the very same basic block.

Other dataflow frameworks provide special callbacks for edge transfers in the CFG, one notable example is the Java SpotBugs ReturnPathType analysis (see edgeTransfer). I am proposing here (and request comments) to introduce a similar concept to the Clang dataflow framework. On top of this general concept, I think, a special edge transfer is a “branch” transfer where the callback provides the condition and the boolean value of the condition.

  void branchTransfer(bool Branch, const Stmt *S,
                      SignPropagationLattice &Vars, Environment &Env);

This “transfer” function shall be called once at the beginning of each basic block which has a direct predecessor that is a conditional statement. To demonstrate the idea, I’ve created a prototype patch, please take a look at ⚙ D133698 [clang][dataflow] SignAnalysis, edgeTransfer, branchTransfer

@Xazax-hun @gribozavr @sgatev @ymand @Xazax-hun @NoQ

Thanks for this suggestion! It sounds good in principle – we already have this kind of support built into the environment handling, so it makes sense to expose it to the custom lattices as well. I think the question will be in the details. I will try to take a look at your full proposal by end of week (and probably sooner) and will discuss with others on this thread.

Cheers,

Generally I support adding the transferBranch function, because it exposes functionality that wasn’t customizable in the framework. It would be useful in many cases, however for this specific application I have a better suggestion.

After looking at the code in SignAnalysisTest.cpp I would like to highlight that it is not using some features of the framework that could simplify the implementation, and take advantage of future improvements to the core (like tracking of escaping).

Furthermore, the current implementation of SignPropagationAnalysis::branchTransfer performs case analysis of the branch condition. It does a very simple pattern-matching of a few kinds of expression shapes. That might be OK, but we could do so much more – and that’s already implemented in the framework core, for example, processing logical “and” and “or”, understanding AST nodes related to temporaries etc. Code in branchTransfer that performs case analysis of the whole branch condition would likely have to re-implement a good chunk of that.

The biggest missed opportunity is that the implementation uses VarMapLattice, which tracks the information per-variable. However the framework provides the Value abstraction, which allows you to attach information (in your case – sign) to the values. Then, the framework would track how the value propagates through the program, how it is assigned to other values, and even how it is passed into functions when context-sensitive analysis is enabled. You won’t need to implement any of that; once you attach sign information to a Value, the same sign information becomes available on every copy of that value. In future, when we implement identification and tracking of escaping, the framework core would automatically invalidate the Values that could have changed.

So, here’s an alternative suggestion for the implementation:

  • attach a couple of boolean values to the Value that models integer values, these boolean values should encode the SignLattice.

  • implement transfer functions for integer comparisons, constraining the returned boolean based on the encoded SignLattice in the inputs.

  • implement joining of the new information.

  • now you can query the flow condition for the signedness of integer values.

Admittedly, we don’t have excellent support for this approach, especially for encoding the lattice values as booleans. Primarily, that’s because we only have a simple SAT solver that does not process higher-level data types like lattice elements. So you have to define a mapping from lattice elements to boolean variables yourself.

Nevertheless, you can find an example of this approach here, where we implement pointer nullability analysis this way: crubit/nullability_verification at main · google/crubit · GitHub We attach two boolean variables to each pointer value (is_non_null and is_known) and implement transfer functions for pointer initialization, pointer comparison, and pointer dereferencing. We let the framework core track value copies.

(an example based on a suggestion from @ymand) However, if you were implementing range tracking for integer values, it wouldn’t be as easy to find a high-performance mapping of a range lattice to booleans that we could feed in the SAT solver. In that case, one only has two choices: either a major refactoring of the core framework to replace the SAT solver with an SMT solver, or something like the transferBranch function that you’re proposing, that would allow you to define your own abstractions instead of being forced to use the builtin flow condition.

Thank you for your comments and review @ymand and @gribozavr. Based on your insights, I am going to dissect the referenced patch into two patches:

  1. A simple infrastructural change that introduces transferBranch. As you said, this could be useful for complex lattices (e.g integer value ranges) until we have an SMT solver in the core framework.
  2. SignAnalysis implementation. I am going to rework it by using the boolean Value abstraction you suggest.

So, I had some time to split this into two, and I created a new patch for sign analysis.

  • A simple infrastructural change that introduces transferBranch. As you said, this could be useful for complex lattices (e.g integer value ranges) until we have an SMT solver in the core framework.

https://reviews.llvm.org/D133698

  • SignAnalysis implementation. I am going to rework it by using the boolean Value abstraction you suggest.

https://reviews.llvm.org/D136668