Range-based solver and eager symbol garbage-collection

I’ve just submitted a PR for improving the range-based solver for CSA.
And in the solver-related tests we have a bunch of explicit void casts at the end.
The intent is to keep constraints artificially alive by explicitly mentioning variables tied to those constraints.
This isn’t too realistic, and it should just work without those void-casts.

Like in my PR, I also had a void cast:

void gh_62215(int x, int y, int z) {
  if (x != y) return; // x == y
  if (z <= x) return; // z > x
  if (z >= y) return; // z < y
  clang_analyzer_warnIfReached(); // no-warning: This should be dead code.
  (void)(x + y + z); // keep the constraints alive.
}

However, if I’d remove that void-cast, the statement would be surprisingly reachable.

It is because if a member is alive in an eqclass, that won’t prevent the rest of the symbols being garbage-collected.
Consequently, after the if (z <= x) return;, x gets garbage-collected and we lose x == y and z > x constraints; so when we encounter if (z >= y) return;, we have basically an empty constraints map with an empty eqclass mapping.

This eager garbage-collection problem and losing constraints is directly tied back to #60836, and why we have a bunch of void-casts in the tests artificially keeping such constraints alive.

Do you agree that it shouldn’t affect the analysis weather there is a void-cast at the end or not. The exploded graph should look exactly the same regardless.
@NoQ @Xazax-hun @DonatNagyE


Note that the example would behave as described, if the mentioned PR is applied; but there are other examples in the test suite already.

I absolutely agree that this overly eager garbage collection is not intended.

I don’t think garbage collection for relational constraint data structures has ever been solved.

It’s quite a non-trivial problem.

For example, if you have constraints $x >= $y and $y >= $z and then the symbol $y dies, then both of these constraints become meaningless in isolation. However, if we collect both of them, we accidentally lose a precious bit of information $x >= $z that immediately followed from them and didn’t have anything to do with $y.

In other words, the correct result of garbage collection of $y on the system of constraints

$x > $y
$y > $z

is a single constraint

∃$y ($x > $y && $y > $z)

which I don’t think we’ve ever implemented (or even know how to describe).

To me, at first glance, a formula ∃$y ($x > $y && $y > $z) should take as much storage as the individual operands within ($x > $y and $y > $z), which makes me question if we would really gain something over simply “leaking” and keeping the constraints for dead symbols.

Do you think it would make sense to experiment with keeping swapping the RangeConstraintManager::removeDeadBindings to a noop? - and leak.

EDIT: I thought about doing a noop there, and that’s not actually what I want.
Rather, only remove constraints if none of its sub-symbols are alive.
In other words remove $x > $y only if $x and $y become dead.
Also for eqclasses: drop an eqclass if none of the class members are alive.

Rather, only remove constraints if none of its sub-symbols are alive.

I strongly support this change. Symbols should be kept alive if they tell us something about other symbols that are still relevant.

Moreover, I suspect that aggressive cleanup of symbols may be a premature optimization with limited effects. Are there any for-each-existing-symbol loops that significantly contribute to the CSA runtime?

I have no numbers, but I would suspect that it significantly contributes to the probability of merging states (whatever unlikely that is).

Probably not, unless you also start simplifying ∃$y ($x > $y && $y > $z) to $x > $z + 1.

My intuition is, it’s probably ok to “leak” state items this way temporarily, as long as each of them gets cleaned up eventually, when associated code is completely forgotten. But if it turns out that some of these constraints are completely unremovable, even when all symbols are dead, then we’re probably in trouble.

(It’s just intuition though. A while ago I experimented with disabling dead symbol collection entirely and noticed a significant performance hit. But I’ve never experimented with individual traits. Maybe constraints aren’t that big of a deal anyway and it’s ok to never garbage-collect them? It’s not like their presence impacts any future calculations, this is entirely about performance.)

Consider

$w >= $x
$x >= $y
$y >= $z

where both $x and $y die.

Even though both symbols in $x >= $y are dead, simplifying this system of constraints to

$w >= $x
$y >= $z

would still throw away too much, because this way we can no longer infer that $w >= $z which was obvious initially.

So this has to be some sort of transitive closure of all constraints in which at least one symbol is alive.

And we aren’t allowed to simply say “well, $x happens to participate in a live constraint, therefore $x is alive”. This would cause constraints to become “material”, impacting future analysis (eg. prevent us from finding memory leaks associated with $x).

(To be fair, equality constraints such as $x == $y should probably be treated as “material” in this sense. Just because $x goes out of scope, doesn’t mean we can’t recover its value from $y. But such equality constraints are a special case for a million other reasons as well, and I’m fairly confident that simply sinking the analysis whenever it encounters such constraints may be the best course of action.)

I see. The problem there is that there are large set (infinite?) of constrant combinations that we would need to fold, so I don’t think it’s feasible in practice.

I agree.

Yea, I see. What @Xazax-hun raised was that we might actually want to do some canonicalization before we add a constraint to the system (and when merging eqclasses). The idea is to replace equivalent symbols with some stable symbol which could be the representative of an eqclass. We would need to do this “replacement” canonicalization at all sub-symbols part of a SymExpr. The same “canonicalization” would be required before performing queries, to get hits in the lookups. But I haven’t thought this through, but it’s possible that infer could be taught to handle this case for lookups. However, this approach would need to keep the representatives of eqclasses alive.

My intention was to not mark these symbols artificially alive, but rather not remove them from the maps. But on second thought, it might raise some lifetime problems.
(but we are using BumpPtrAllocators right?)

I’m not sure if I understood this.
If the symbols are equal, then this equality might be relied upon in code. (except for pointer values of course, because they carry provenance)

Does “material” mean keeping it forever? I was considering the opposite for equations.

Suppose we have $x == $y and $x == $z as constraints in a state and we want to garbage collect $x for the state. Can we do this?

  • First substitute $x with $y for every $x in the state. Then we will have $x == $x and $y == $z in the constraints and no $x in other places of the new state after substitution. The new state is equivalent to the old one.
  • Then we can garbage collect $x as well as $x == $x with satisfactory.

Basically, it is a simplification process for equations.