[GSoC] Re: CSA constraint solver improvements

Hi everyone,

I am a pre-final year undergraduate in computer science. I am interested
in "Clang Static Analyzer: constraint solver improvements" project.

I have around 8 months of experience with LLVM/Clang during my compilers
class where I also implemented some dataflow analysis techniques for LLVM
IR. I am also in middle of completing my decision procedures course at
university. This course has helped me in gaining fundamental knowledge
about solvers. I am familiar with range-based logic. Along with these, I
have tried z3 while learning about solvers, and I think I can learn more
about it fairly quickly.

I think this project fits for me. It will also help me in improving my
knowledge further. I have started working on my proposal.

Can you give me further directions regarding the proposal for this
project?

Thank you

Good to hear that someone is interested in the SMT Solver part of the analyzer.

Mikhail Ramalho is probably the primary driving factor in this direction, but I try to participate/help as much as I can.
You probably already know, Artem Dergachev is the code owner of the CSA, so I guess, you can count on him as well.

Unfortunately, the SMT solver is not maintained, thus it is in pretty bad shape. It can not pass the tests for various reasons.
I planned to improve the situation, but it was always a low priority for me. My primary concern is the Z3 solver for bugreport refutation for the range-based solver.

I recommend you have a look at patches in the past using git blame.
AFAIK these three are the most important patches which not landed yet:
D83677, D83660, D85528

Regards, Balazs.

Hi Manas,

Great to see some interest for the project!

The solver in question is fully located in RangeConstraintManager.cpp. There we try our best to provide useful ranges for symbolic expressions AND do it real fast. The last part is probably the most important thing here. This solver works every time we see a condition in code and when we tried using z3 for this instead, it degraded performance so drastically that the analyzer is not useable (e.g 10min → 25h).

However, we have found another application for z3 - refute produced warnings. When we are about to report a new warning, we can check that constraints are sat/unsat and discard the warning in the latter case. This happens way less frequently and good in terms of performance. Alas, the majority of users have static analyzer build without z3, and some false positives warnings sneak in.

So, there are two main directions that I see:

  • Figure out cases when z3 refutation works better than the built-in fast solver.
    Run the analyzer in both modes and analyze the difference. Usually it’s under 10 warnings, so it won’t be very tedious.
  • Add reasoning about range-based binary operations.
    If we know ranges for symbols or symbolic expressions x and y, we can often reason about possible ranges for x OP y, where OP is some binary operator. At the moment, we have support for &, |, and % (it’s a bit weird set of operators, but it was driven by reported false positives).

We are open for other suggestions and algorithms in this component, but I guess it’s valuable to start with the first bullet point to have a good motivation for improvements.

@Balazs, sorry for confusion. Here is a short description of the project I suggested for this year: https://llvm.org/OpenProjects.html#static_analyzer_constraint_solver

It’s a logical continuation of my work:
D86465, D82445, D83286, D82381, D80117, D79434, and D79336

Cheers,
Valeriy

Oh my bad. I thought it was about the SMT solver. Now I’m even more engaged if we plan to focus on practical usability and performance instead of doing this for academic research.

Sorry for the confusion I might have caused.

Thanks, Valeriy!

The solver in question is fully located in RangeConstraintManager.cpp. There we try our best to provide useful ranges for symbolic expressions AND do it real fast. The last part is probably the most important thing here. This solver works every time we see a condition in code and when we tried using z3 for this instead, it degraded performance so drastically that the analyzer is not useable (e.g 10min --> 25h).

I went through the code in RangeConstraintManager.cpp and comprehended
about the constraint manager overall. canReasonAbout() (line: 1948)
shows which all operations are left being implemented (namely, Mul, Div,
Shl, Shr, and also the Xor operator). I have added these into the
proposal.

However, we have found another application for z3 - refute produced warnings. When we are about to report a new warning, we can check that constraints are sat/unsat and discard the warning in the latter case. This happens way less frequently and good in terms of performance. Alas, the majority of users have static analyzer build without z3, and some false positives warnings sneak in.

So, there are two main directions that I see:
Figure out cases when z3 refutation works better than the built-in fast solver.
Run the analyzer in both modes and analyze the difference. Usually it’s under 10 warnings, so it won’t be very tedious.

I have added this into the proposal too.

Add reasoning about range-based binary operations.
If we know ranges for symbols or symbolic expressions x and y, we can often reason about possible ranges for x OP y, where OP is some binary operator. At the moment, we have support for &, |, and % (it’s a bit weird set of operators, but it was driven by reported false positives).

Going through D80117, D79434, and D79336, one of the first things was to
bisect the condition based on the sign of symbols. Followed by
appropriate handling of upper/lower bounds for the expression.

Regarding the TODO #1 (L:931) and TODO #2 (L:936) in RangeConstraintManager.cpp,
should I put these into the proposal too? (this to be done for other binary
operations as well apart from {rem, and, or} operations)

These todos are related to lazy implemetation for VisitBinaryOperator and
caching mechanism for analysis of nested expressions.

It’s a logical continuation of my work:
D86465, D82445, D83286, D82381, D80117, D79434, and D79336

I went through these patches as well in order to understand the workflow
and things which have been done so far.

The description of project also mentions about a unit-test framework.
Can I get more details regarding this?

I have drafted the initial version of my proposal here.
(https://docs.google.com/document/d/1ts_jxi1eAPGgeSzy5cs1dwZDcrmCHao8skZT8yWtOPI/edit?usp=sharing)

Thank you

I went through the code in RangeConstraintManager.cpp and comprehended
about the constraint manager overall. canReasonAbout() (line: 1948)
shows which all operations are left being implemented (namely, Mul, Div,
Shl, Shr, and also the Xor operator). I have added these into the
proposal.

There is also an elephant in the room, - Add and Sub.

However, we have found another application for z3 - refute produced warnings. When we are about to report a new warning, we can check that constraints are sat/unsat and discard the warning in the latter case. This happens way less frequently and good in terms of performance. Alas, the majority of users have static analyzer build without z3, and some false positives warnings sneak in.

So, there are two main directions that I see:
Figure out cases when z3 refutation works better than the built-in fast solver.
Run the analyzer in both modes and analyze the difference. Usually it’s under 10 warnings, so it won’t be very tedious.

I have added this into the proposal too.

Add reasoning about range-based binary operations.
If we know ranges for symbols or symbolic expressions x and y, we can often reason about possible ranges for x OP y, where OP is some binary operator. At the moment, we have support for &, |, and % (it’s a bit weird set of operators, but it was driven by reported false positives).

Going through D80117, D79434, and D79336, one of the first things was to
bisect the condition based on the sign of symbols. Followed by
appropriate handling of upper/lower bounds for the expression.

Correct, we should always care about the bit vector nature of integers.

Regarding the TODO #1 (L:931) and TODO #2 (L:936) in RangeConstraintManager.cpp,
should I put these into the proposal too? (this to be done for other binary
operations as well apart from {rem, and, or} operations)

It didn’t seem to cause any performance problems so far, and I’d say this is not a priority right now.

These todos are related to lazy implemetation for VisitBinaryOperator and
caching mechanism for analysis of nested expressions.

It’s a logical continuation of my work:
D86465, D82445, D83286, D82381, D80117, D79434, and D79336

I went through these patches as well in order to understand the workflow
and things which have been done so far.

The description of project also mentions about a unit-test framework.
Can I get more details regarding this?

As you looked through those patches, you probably noticed that we usually add more checks into constant-folding.c test.
It is pretty good, but it would be much better to have a dedicated section of unit-tests especially because the solver is pretty isolated and doesn’t really depend on AST (only on APSInt’s actually, but this is easy to overcome, you can see how it is done in RangeSetTest.cpp).

I guess my idea of perfect unit tests for the solver would be to have a special class for test symbols that we can construct like this TestSymbol x = {{0, 10}, {20, 40}} with overloaded operators that call solver, so we can do stuff like this: TestSymbol z = x + y; /* check for expected ranges in z */
This way the actual tests are very straightforward, easy to add new ones, easy to read and maintain. And all the nitty gritty should be hidden inside of the test class and the test symbol class. The main challenge here is to design this testing framework, so it is as invisible as possible.

I have drafted the initial version of my proposal here.
(https://docs.google.com/document/d/1ts_jxi1eAPGgeSzy5cs1dwZDcrmCHao8skZT8yWtOPI/edit?usp=sharing)

Awesome, I’ll take a look and respond in comments.

Thank you

Manas
CSAM Undergraduate | 2022
IIIT-Delhi, India

Cheers,
Valeriy

> I went through the code in RangeConstraintManager.cpp and comprehended
> about the constraint manager overall. canReasonAbout() (line: 1948)
> shows which all operations are left being implemented (namely, Mul, Div,
> Shl, Shr, and also the Xor operator). I have added these into the
> proposal.
There is also an elephant in the room, - Add and Sub.
>

I have added these too.

> Regarding the TODO #1 (L:931) and TODO #2 (L:936) in RangeConstraintManager.cpp,
> should I put these into the proposal too? (this to be done for other binary
> operations as well apart from {rem, and, or} operations)
It didn’t seem to cause any performance problems so far, and I’d say this is not a priority right now.

Understood. I removed them from the timeline.

> The description of project also mentions about a unit-test framework.
> Can I get more details regarding this?
As you looked through those patches, you probably noticed that we usually add more checks into constant-folding.c test.
It is pretty good, but it would be much better to have a dedicated section of unit-tests especially because the solver is pretty isolated and doesn’t really depend on AST (only on APSInt’s actually, but this is easy to overcome, you can see how it is done in RangeSetTest.cpp).

I guess my idea of perfect unit tests for the solver would be to have a special class for test symbols that we can construct like this TestSymbol x = {{0, 10}, {20, 40}} with overloaded operators that call solver, so we can do stuff like this: TestSymbol z = x + y; /* check for expected ranges in z */
This way the actual tests are very straightforward, easy to add new ones, easy to read and maintain. And all the nitty gritty should be hidden inside of the test class and the test symbol class. The main challenge here is to design this testing framework, so it is as invisible as possible.
>

I went through RangeSetTest.cpp. Now I have a clearer picture of the
framework. Basically, as you said, we need to have class abstraction
and methods wrapping around various types of checks for operators.

The segregation of test cases should be based upon the binary operation
(as in constant-folding.c).

Hi,

The solver in question is fully located in RangeConstraintManager.cpp. There we try our best to provide useful ranges for symbolic expressions AND do it real fast. The last part is probably the most important thing here. This solver works every time we see a condition in code and when we tried using z3 for this instead, it degraded performance so drastically that the analyzer is not useable (e.g 10min → 25h).

Yeah, IIRC most of the slowdown comes from the calls at branching points: we end up calling the solver twice and we never cache the results :confused: Incremental solving would probably improve the performance a bit.

Fun fact: I implemented the support for Boolector, CVC4, MathSAT, and Yices in a separate fork, and Yices does considerably better than Z3, however, still not good enough to make it usable.

Feel free to ask any questions you might have about the SMT stuff in LLVM. The bug refutation might be a good source of insight for your project, especially if you are looking at implementing these other operations.

BTW, any plans to implement anything floating-point-related?

I am hoping to look in this z3 refutation (this month) before starting the
project if my university workload allows it. :slight_smile:
I agree with you that it will surely give me an insight for the work.

We have only discussed about integers so far.

However, studying about solvers has really intrigued me. And I
am willing to be an active contributor in CSA for future too. :slight_smile: