[GSoC 2022] Draf proposal

Hi!

I’m interested in taking part in Google Summer of Code by working on Clang Static Analyzer. The project I’d work on is Implement support for C++17 structured bindings in the Clang Static Analyzer.

I started to look into how the static analyzer operates with said structured bindings now, and why I think it behaves like that. I’ve written my thoughts in a yet draft proposal which you can find here.

I appreciate any feedback!

Hi,

Very nice!

You seem to be focusing on modeling binding decls. Note that we know a lot more about binding decls at compile time than we do about regular variables or references. It might be that field name and Store binding for the decomposition are sufficient to compute ‘x’ and ‘y’ immediately at any point within its scope so we actually don’t need to store them in the abstract state at all. Can you confirm or deny that?

In the tuple-like case, does the AST have call expressions for std::get at all anywhere? What does the use site look like when you’re trying to use one of the binding variables?

Generally, don’t forget to look at the use site. You can’t find out what to keep in the abstract state until you know what you’ll need to model future queries.

Hi @NoQ!

Thanks for the feedback.

As you suggested I created a snippet, where I actually use a bound variable.

int main() 
{ 
    int arr[2];
    auto [a,b] = arr; 

    int i = a;

    return 0; 
}

In the AST a is referenced.

|-DeclStmt <line:8:5, col:14>
    | `-VarDecl <col:5, col:13> col:9 i 'int' cinit
    |   `-ImplicitCastExpr <col:13> 'int' <LValueToRValue>
    |     `-DeclRefExpr <col:13> 'int' lvalue Binding 0x56452d8d0a10 'a' 'int'

After checking on the snippet it turnes out that ExprEngine::VisitCommonDeclRefExpr is called (in my proposal I falsely stated it isn’t).

Inside the function the state object looks like this:

...
 "store": { "pointer": "0x555563d1d488", "items": [
    { "cluster": "NonParamVarRegion{D646}", "pointer": "0x555563d1d3d0", "items": [
      { "kind": "Default", "offset": 0, "value": "Unknown" }
    ]}
  ]},
...

It seems like the temporary array is in the store (?). I checked what happens if I use b in the program later, but the store looks the same. If neither a nor b is referenced, the store is empty, like arr itself is not stored at all.

Let’s assume a ponter to the first element of the temporary array is put in the store, as an array can be identified by a pointer to it’s first element, thus arr == &arr[0] == arr + 0. The problem here is that later we want to read the value of a and we don’t know the index.

So based on this I deny that the field name and the store binding is enough to get the value. An index should be associated with the field name.

In the following snippet

...
std::tuple<float, char, int> tpl(x, y, z);
auto [a, b, c] = tpl;

float i = a;
...

the use site looks like a regular assignment referencing the tuple element.

|-DeclStmt <line:12:5, col:16>
    | `-VarDecl <col:5, col:15> col:11 i 'float' cinit
    |   `-ImplicitCastExpr <col:15> 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float' <LValueToRValue>
    |     `-DeclRefExpr <col:15> 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float' lvalue Binding 0x55906f9e0120 'a' 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float'

In the program state it seems the tuple is stored and the name of the tuple is mapped to the representation:

"program_state": {
    { "cluster": "NonParamVarRegion{D418913}", "pointer": "0x555564269930", "items": [
      { "kind": "Default", "offset": 0, "value": "lazyCompoundVal{0x555564267f08,tpl}" }
    ]}
  ]},
  "environment": { "pointer": "0x555563cfe590", "items": [
    { "lctx_id": 1, "location_context": "#0 Call", "calling": "main", "location": null, "items": [
      { "stmt_id": 432130, "pretty": "tpl", "value": "lazyCompoundVal{0x555564267f08,tpl}" }
    ]}
  ]},

Opposed to arrays, knowing about tpl and the field name makes it possible to read the specific value, as the AST contains the index of the element, which is 0 in this case.

I’d also like to have a question regarding how should I translate this cluster.

"cluster": "NonParamVarRegion{D418913}", "pointer": "0x555564269930", "items": [
      { "kind": "Default", "offset": 0, "value": "lazyCompoundVal{0x555564267f08,tpl}" }
    ]}

At first glance I assume, the cluster is just a location in memory, where the data is stored, which in this case is a tuple. But in that case why is the value Unknown ({ "kind": "Default", "offset": 0, "value": "Unknown" }) when an array is stored? Is it supposed to be the value of the first element in the array?

A cluster is, like, an entire allocation of memory. Eg., a single stack variable (stack allocation) or a return value of new (heap allocation), etc. As opposed to sub-objects like fields or array elements. You can jump from one sub-object to another within a cluster with pointer arithmetic, eg. increment the index to point to the next element, or find a field by offset. You can’t do that across clusters.

In RegionStore, which is our implementation of symbolic memory, all stored values are grouped by clusters. Within a cluster it’s sufficient to specify an offset to find the value, which is what you see as "offset": ... in the dumps.

Now, the Unknown value (clang: clang::ento::UnknownVal Class Reference) is basically an indication that we failed to model something. If you see an UnknownVal, it’s always a bug, it’s always not how it’s supposed to be. It looks like we don’t know how to write an array initializer in the array case at all, so it’s unknown. It might be a separate bug that we’ll have to address if we want this test case to work properly.

In the other example looks like we have a compound value that covers the entire cluster. That’s good, this means we can theoretically extract values of all the fields from it.

The problem here is that later we want to read the value of a and we don’t know the index.

But at least we know that it’s a and not b right? We know at compile time in which order they were originally written in the structured binding. Doesn’t that tell us a lot already?

the use site looks like a regular assignment referencing the tuple element.

So there’s no call expressions like anywhere? Can you find out when exactly is are we supposed to call get<>() in the tuple case - at the store site or at every load?

This is what I’ve been thinking about lately, but if I’m right at the moment we don’t know it. The only thing I can tell by looking at the AST is that the address in the DeclRefExpr is smaller if the symbol was declared earlier. By address I mean for example 0x55fabea94a00 in DeclRefExpr <col:13> 'int' lvalue Binding 0x55fabea94a00 'a' 'int'.

First I thought it points to an instruction or a read-only memory segment where with some trick we can find the value, but when I checked in GDB it said the address was not accessabble. What I know for sure is that it doesn’t point to a value directly, as in case of an array decomposition the distance between the address next to symbol a and b is 72, which is too large for an int.

There are call expressions during tuple contruction. Also based on the AST of a snippet like this

std::tuple<float,char,int> tpl(x,y,z);
auto [a,b,c] = tpl;

float i = a;
|-DeclStmt <line:13:5, col:16>
    | `-VarDecl <col:5, col:15> col:11 i 'float' cinit
    |   `-ImplicitCastExpr <col:15> 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float' <LValueToRValue>
    |     `-DeclRefExpr <col:15> 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float' lvalue Binding 0x562459023f20 'a' 'std::tuple_element<0, std::tuple<float, char, int>>::type':'float'

We already seem to know about the element as if we called get<>() already. I mean calling get explicitly looks like this:

std::tuple<float,char,int> tpl(x,y,z);
float a = std::get<0>(tpl);
|-DeclStmt <line:11:5, col:31>
    | `-VarDecl <col:5, col:30> col:11 a 'float' cinit
    |   `-ImplicitCastExpr <col:15, col:30> '__tuple_element_t<0UL, tuple<float, char, int>>':'float' <LValueToRValue>
    |     `-CallExpr <col:15, col:30> '__tuple_element_t<0UL, tuple<float, char, int>>':'float' lvalue
    |       |-ImplicitCastExpr <col:15, col:25> '__tuple_element_t<0UL, tuple<float, char, int>> &(*)(tuple<float, char, int> &) noexcept' <FunctionToPointerDecay>
    |       | `-DeclRefExpr <col:15, col:25> '__tuple_element_t<0UL, tuple<float, char, int>> &(tuple<float, char, int> &) noexcept' lvalue Function 0x55f3fe0c3bf0 'get' '__tuple_element_t<0UL, tuple<float, char, int>> &(tuple<float, char, int> &) noexcept' (FunctionTemplate 0x55f3fdfb50e0 'get')
    |       `-DeclRefExpr <col:27> 'std::tuple<float, char, int>':'std::tuple<float, char, int>' lvalue Var 0x55f3fdff7638 'tpl' 'std::tuple<float, char, int>':'std::tuple<float, char, int>'

What the call expression returns is what we reference in case of binding. __tuple_element_t<0UL, tuple<float, char, int>> and std::tuple_element<0, std::tuple<float, char, int>>::type literally the same.

template<std::size_t __i, typename _Tp>
    using __tuple_element_t = typename tuple_element<__i, _Tp>::type;

So I think get should be called when the tuple is put in the store.

I’ve been thinking about how we know in which order the BindingDecls were declared, when we evaluate a DeclRefExpr.

When ExprEngine::VisitCommonDeclRefExpr is called, the expression and the corresponding declaration is passed as parameter.

In case of the expression DeclRefExpr 0xdc4f1f0 'int' lvalue Binding 0xdc4ee80 'a' 'int' the corresponding BindingDecl is passed, which is

BindingDecl 0xdc4ee80 <array_member/use.cpp:4:11> col:11 referenced a 'int'
`-ArraySubscriptExpr 0xdc4f0c0 <col:11> 'int' lvalue
  |-ImplicitCastExpr 0xdc4f0a8 <col:11> 'int *' <ArrayToPointerDecay>
  | `-DeclRefExpr 0xdc4f068 <col:11> 'int[2]' lvalue Decomposition 0xdc4ef10 '' 'int[2]'
  `-IntegerLiteral 0xdc4f088 <col:11> 'int' 0

Knowing this, we can extract the index on the appropriate branch.

if (const auto *BD = dyn_cast<BindingDecl>(D)) {
    // FIXME: proper support for bound declarations.
    // For now, let's just prevent crashing.
    
    const auto *Expr = BD->getBinding();
    if(const auto *ArrSubScript = dyn_cast<ArraySubscriptExpr>(Expr)) {
      //Extract the index here by calling ArrSubScript->getIdx()
    }

    return;
}

I’ve also been thinking about how could BindingDecls be evaluated. I can think of 2 different approaches.

  if (const auto *BD = dyn_cast<BindingDecl>(D)) {

    const auto *BindingExpr = BD->getBinding();

    if(const auto *ASE = dyn_cast<ArraySubscriptExpr>(BindingExpr))
    {
      //Extract the required data here and process it, 
      //technically we reinvent the existing

      const Expr *Base = ASE->getBase()->IgnoreParens();
      const Expr *Idx  = ASE->getIdx()->IgnoreParens();
      QualType T = ASE->getType();

      SVal V = state->getLValue(T,
                                state->getSVal(Idx, LCtx),
                                state->getSVal(Base, LCtx));

      Bldr.generateNode(Ex, Pred, state->BindExpr(Ex, LCtx, V), nullptr,
                      ProgramPoint::PostLValueKind);
      return;
    }
    if(const auto *DRE = dyn_cast<DeclRefExpr>(BindingExpr))
    {
      //Delegate the work to the appropriate visitor

      ExplodedNodeSet tmpDst;
      VisitCommonDeclRefExpr(Ex, DRE->getFoundDecl(), Pred, tmpDst);

      Dst.insert(tmpDst);
      Pred = *tmpDst.begin();
      
      return;      
    }

    return;
  }

At the moment none of the solutions work, as in case of arrays the temporary array is Unknown in the store and in case of tuples I get strange results.

The snippet I used for testing is

int main()
{
    float x = 1;
    char y = 2;
    int z = 3;

    std::tuple<float, char, int> tpl(x, y, z);
    auto [a, b, c] = tpl;

    float i = a;
    char j = b;
    int k = c;

    return 0;
}

After evaluating float x = 1; the value of x in the store is Unknown, y and z are recorded correctly. Later I get promising results like in case of char j = b;, the store looks like
j -> conj_$0{std::tuple_element<0, class std::tuple<char, int> >::type, LC1, S483423, #1} and then the egraph is split into two and on one branch I get core.NullDereference in the DeclRefExpr of c.

It’s like the symbols are bound to the corresponding tuple elements but the value of those elements is not read properly.

While I get to this thread, just wanted to say that even though there’s definitely a lot more to discuss there, the deadline is approaching at April 19, 18:00 UTC to submit the proposal PDF to the GSoC website. Please don’t wait for our full approval, submit the proposal regardless!