Expose the inferred range information in warning messages

Clang Static Analyzer follows the execution of the analyzed code by maintaining a mapping from symbols [1] to RangeSets that represent the potential values. (I know that we also have Z3, but if I understand correctly standalone Z3 does not work in practice.)

There are some checkers (e.g. ArrayBound checking) where it would be useful if the error message reported the inferred range of certain values (e.g. the index causing the overflow). However, the ConstraintManager interface is really restrictive, and it doesnā€™t provide any public method that would expose something like RangeConstraintManager::getRange. In addition to performing assumptions, the checkers can and do extract concrete values when some symbol happens to be constrained to a single value, but otherwise they are limited to awkwardly phrased, cryptic messages.

As a concrete example, after my recent fixes alpha.security.ArrayBoundV2 can emit messages like

Access of ā€˜arrayā€™ at index 11, while it holds only 10 ā€˜intā€™ elements

in cases when the concrete index is known, but if we only have an assumption like idx >= 12 then it cannot say more than

Access of ā€˜arrayā€™ at an overflowing index, while it holds only 10 ā€˜intā€™ elements

I would like to expose a getRange()-like method in the ConstraintManager interface to be able to produce better error messages like

Access of ā€˜arrayā€™ at index >= 12, while it holds only 10 ā€˜intā€™ elements

To remain compatible with the (mostly theoretical) non-range-based constraint managers, I could allow nullptr/nullopt as a ā€œrange information not supportedā€ dummy return value from the public getRange() method.

I know that the individual assumptions of the checkers are also reported as note tags, but the bug path is very noisy (it can contain dozens of irrelevant ifs) and itā€™s common that the final inferred range is composed from several elementary assumptions, so I think itā€™s important to report the details of the ā€œfinal verdictā€ to the user (in addition to the bug path).

@NoQ @Xazax-hun @steakhal What do you think about this change? May I add a method that approximately looks like RangeSet getRange(ProgramStateRef State, SymbolRef Sym) to the ConstraintManager interface? Do you happen to know about any invariants, design goals, pitfalls, tricky details etc. in this area that I should pay attention to?

[1] More precisely: equivalence classes of symbols that are known to be equal at that point.

Iā€™m in favor of exposing the associated ranges.
In fact, downstream we already do that.

Itā€™s going to tell you the actual assumptions, but not how they were acquired.
The most tricky scenario Iā€™ve faced was in the presence of aliasing and inheriting constraints from equivalent class members. Unfortunately, I could not come up with a good example, but here is a bad one: Compiler Explorer

void clang_analyzer_eval(bool);
void clang_analyzer_value(int);
void top(int x, int y) {
  if (x != y) return; // x == y
  if (x < 10) return; // x >= 10
  if (y > 20) return; // y <= 20
  //clang_analyzer_value(x); // x: [10,20]
  int buf[5] = {};
  buf[x] = 404;
}

Notice that the x == y assumption is the first one, thus by reading the bug report backward, (as recommended to read), itā€™s not evident that x and y are related - only after reading the bug report for the second time.

Consequently, exposing solver APIs might offer improvement opportunities even on the bug report visitors side as well.


Have you considered only exposing the stringified constraints, instead of the RangeSet associated with a symbol? I fear, without the right abstractions/utilities these RangeSets arenā€™t really useful for checkers if each checker manually needs to construct meaningful messages from them.

In fact, downstream we already do that.

If you already have a working implementation for this, then perhaps you should release it because then (1) instead of reimplementing it, I could work on something that would be new (and perhaps useful for you) and (2) you wouldnā€™t need to mange conflicts between the upstream and downstream variants when youā€™re incorporating changes from the open source.

Have you considered only exposing the stringified constraints, instead of the RangeSet associated with a symbol? I fear, without the right abstractions/utilities these RangeSets arenā€™t really useful for checkers if each checker manually needs to construct meaningful messages from them.

RangeSet has convenient getMinValue() / getMaxValue() methods and I was planning to use that when Iā€™m constructing the messages. I think this would be more flexible and useful than a ā€œfrozenā€ string representation, e.g. when Iā€™m reporting an overflow, Iā€™d probably only mention the lower bound.

Well, we exposed the range sets for our alternative bug reporting mechanism. We also have an alternative implementation to the OOBV2 checker that uses some range information for messages if I recall, and our checker also has its own problems thus we use it along with V2. So in short, I donā€™t have anything directly useful for upstream, and Iā€™m fine having merge conflicts. To be clear, other than time and effort, Iā€™m not against upstreaming.

I agree, Iā€™m only advocating for having utilities for consuming the range set in a limited but useful way, e.g. for constructing the message you are interested. (In addition to exposing the raw RangeSet)

Ok, thanks for the clarifications. Itā€™s clear that upstreaming is not worth it when the projects are already divergent in other ways as well.

I agree that if there will be multiple checkers that incorporate RangeSet data into their reports, then they should probably use common helper functions to construct the messages, but I donā€™t want to design this in advance because I canā€™t predict what will be relevant.

1 Like

Iā€™ve always been in favor of this! Adding notes such as ā€œ[Tracked value] constrained to [5, 6] U {8} U [11, 12]ā€ makes a lot of sense to me.

But at the same time in an offline conversation @ziqingluo-90 came up with a much more principled theoretical solution to explaining the internal logic of the constraint solver in path notes, and I really hope he gets some time to share it :sob: It theoretically allows us to highlight the precise constraints that lead to the bug condition, ignoring the irrelevant constraints, and it works regardless of the internal implementation of the constraint solver, so itā€™d work equally well with like Z3.

As an initial concrete step I created the PR [analyzer] Let the checkers query upper and lower bounds on symbols by DonatNagyE Ā· Pull Request #74141 Ā· llvm/llvm-project Ā· GitHub which provides a basic API and a proof-of-concept application of it in the checker core.BitwiseShift.

Although I originally planned to expose a RangeSet value, I ended up only exposing the minimal and maximal possible values, because those were sufficient for this commit (and they seem to be sufficient for my plans in ArrayBoundV2 as well) and I wouldā€™ve had to move RangeSet into a separate header file to be able to use it as a the return type of a method of ConstraintManager.

Of course, later we can expose the RangeSet if thereā€™ll be a situation where itā€™ll be needed.

a much more principled theoretical solution to explaining the internal logic of the constraint solver in path notes, and I really hope he gets some time to share it

That sounds interesting and would be a big step forward, but I think that this change (reporting the final conclusions of the checker) would be relevant even if we could report all the important individual assumptions (without noise).

(I merged the commit mentioned in my previous comment. Iā€™ll continue with using these tools in ArrayBoundV2 when my pending PRs are resolved there.)

I absolutely agree. Itā€™s valuable to display both the equations as written in the program, and the final solution to these equations.