Memory modeling, RegionStore Questions

Hello, I’ve been studying the RegionStore implementation of the static analyzer in an attempt to debug some false positives I’ve come across. I found the paper “Memory Model for the Static Analysis of C Programs” and Artem’s workbook (“Clang Static Analyzer – A Checker Developer’s Guide”) , and read through many past articles. These are excellent references to get started, now I’d like to enhance my comprehension by jumping in and fixing a few issues that may need to be fixed. I apologize in advance if I’m missing some obvious bridges to comprehension, but I’m still early in the process of learning this. I’d like to contribute to improving the static analyzer and learn best through debugging and implementing. Which brings me to the discussion questions and asks for a few tips on next steps. Thanks in advance for tips and suggestions for how to contribute.

Question 1:

Constructing a case to understanding how data and pointers to data are mapped, I came across some unexpected behavior. In the example below, I’d expect the UNKNOWN values to evaluate to true. Is this a reasonable expectation for the clang static analyzer?

Digging into this problem a bit more with the debugger, it seems at least a small portion of RegionStore is not fully implemented. I debugged this down to the code snippet shown in the data for Q1. ‘a’ comes back as 0x11223344 into V (line 1723), V->getAsSymbol() comes back as NULL (line 1724), V is known, so UnknownVal() is returned – matching the test case results. There’s also a comment in the code at that point – “// Other cases: give up. We are indexing into a larger object that has some value, but we don’t know how to handle that yet.”.

Seems it’s possible to construct concrete values for these cases, if I’m not missing anything – but I’m not sure how to do this. I see the return type, and the offset into the base type, so it seems as “easy” as bit twiddling to return the expected byte or short values. Is there a review that demonstrates how to accomplish this, perhaps some tips?

Question 2:

Looking through CStringChecker.cpp, I see that memset is implemented only for the case of an initialization value of 0, and the sizeof arg and buffer extents are equal (see example in data below, line 1076, also a FIXME in the comments). This seems reasonable for most use cases, but is also interesting since an implementation that allows for arbitrary initialization values could open up possibilities for other improvements (see Question 3). ProgramState.cpp has a method named bindDefaultInitialValue that seems to be perfect for this application, but asserts in RegionStore’s BindDefaultInitial → “Double initialization!” (approx line 452). Would repurposing this method (bindDefaultInitial) be a reasonable approach to solve this problem, or is there another way that would be a better approach?

Question 3:

Looking through CStringChecker.cpp, I see that memcpy does not perfectly model the memory copy from source to destination memory. Memcpy looks to be implemented as a blank invalidation of the source and destination buffers. I had first looked at this problem, and was quite puzzled since I had assumed memory copies should be modeled (at least for common cases). The suggested approach in the comment to address this issue is using LCVs. Is there an example of such an approach, or perhaps an outline to start with?

Case data for Question 1:

void clang_analyzer_eval(int);

void clang_analyzer_dump(int);

void clang_analyzer_printState();

int foo(void) {

unsigned int a = 0x11223344;

unsigned char p = (unsigned char)&a;

unsigned short *pu = (unsigned short *)&a;

clang_analyzer_printState();

clang_analyzer_eval(p[0] == 0x44); // expected-warning{{TRUE}} – OK

clang_analyzer_eval(p[1] == 0x33); // expected-warning{{TRUE}} – UNKNOWN

clang_analyzer_eval(p[2] == 0x22); // expected-warning{{TRUE}} – UNKNOWN

clang_analyzer_eval(p[3] == 0x11); // expected-warning{{TRUE}} – UNKNOWN

clang_analyzer_eval(pu[0] == 0x3344); // expected-warning{{TRUE}} – OK

clang_analyzer_eval(pu[1] == 0x1122); // expected-warning{{TRUE}} – UNKNOWN

clang_analyzer_dump(&p[0]); – &Element{a,0 S64b,unsigned char} [as 32 bit integer]

clang_analyzer_dump(&p[1]); – &Element{a,1 S64b,unsigned char} [as 32 bit integer]

clang_analyzer_dump(&pu[0]); – &Element{a,0 S64b,unsigned short} [as 32 bit integer]

clang_analyzer_dump(&pu[1]); – &Element{a,0 S64b,unsigned short} [as 32 bit integer]

}

The program state from the decls in the sample are shown here.

“program_state”: {

“store”: { “pointer”: “0x76787f8”, “items”: [

{ “cluster”: “a”, “pointer”: “0x7677fc0”, “items”: [

{ “kind”: “Direct”, “offset”: 0, “value”: “287454020 U32b” }

]},

{ “cluster”: “p”, “pointer”: “0x7678728”, “items”: [

{ “kind”: “Direct”, “offset”: 0, “value”: “&Element{a,0 S64b,unsigned char}” }

]},

{ “cluster”: “pu”, “pointer”: “0x767d9a8”, “items”: [

{ “kind”: “Direct”, “offset”: 0, “value”: “&Element{a,0 S64b,unsigned short}” }

]}

]},

“environment”: { “pointer”: “0x7607d50”, “items”: [

{ “lctx_id”: 1, “location_context”: “#0 Call”, “calling”: “foo”, “location”: null, “items”: [

{ “stmt_id”: 805, “pretty”: “clang_analyzer_printState”, “value”: “&code{clang_analyzer_printState}” }

]}

]},

“constraints”: null,

“dynamic_types”: null,

“constructing_objects”: null,

“checker_messages”: null

}“program_state”: {

“store”: null,

“environment”: { “pointer”: “0x7607d50”, “items”: [

{ “lctx_id”: 1, “location_context”: “#0 Call”, “calling”: “foo”, “location”: null, “items”: [

{ “stmt_id”: 1201, “pretty”: “clang_analyzer_printState”, “value”: “&code{clang_analyzer_printState}” }

]}

]},

“constraints”: null,

“dynamic_types”: null,

“constructing_objects”: null,

“checker_messages”: null

Code snippet from RegionStore.cpp, Function getBindingForElement, line #s approximate.

1716 if (const TypedValueRegion *baseR =

1717 dyn_cast_or_null(O.getRegion())) {

1718 QualType baseT = baseR->getValueType();

1719 if (baseT->isScalarType()) {

1720 QualType elemT = R->getElementType();

1721 if (elemT->isScalarType()) {

1722 if (Ctx.getTypeSizeInChars(baseT) >= Ctx.getTypeSizeInChars(elemT)) {

1723 if (const Optional &V = B.getDirectBinding(superR)) {

(gdb) p V->dump()

287454020 U32b$2 = void

(gdb) p R->dump()

Element{a,1 S64b,unsigned char}$3 = void

1724 if (SymbolRef parentSym = V->getAsSymbol())

1725 return svalBuilder.getDerivedRegionValueSymbolVal(parentSym, R);

1726

1727 if (V->isUnknownOrUndef())

1728 return *V;

1729 // Other cases: give up. We are indexing into a larger object

1730 // that has some value, but we don’t know how to handle that yet.

1731 return UnknownVal();

1732 }

1733 }

1734 }

Question 2 – CStringChecker.cpp, method memsetAux

1071 if (StateWholeReg && !StateNotWholeReg && StateNullChar &&

1072 !StateNonNullChar) {

1073 // If the ‘memset()’ acts on the whole region of destination buffer and

1074 // the value of the second argument of ‘memset()’ is zero, bind the second

1075 // argument’s value to the destination buffer with ‘default binding’.

1076 // FIXME: Since there is no perfect way to bind the non-zero character, we

1077 // can only deal with zero value here. In the future, we need to deal with

1078 // the binding of non-zero value in the case of whole region.

1079 State = State->bindDefaultZero(svalBuilder.makeLoc(BR),

1080 C.getLocationContext());

Question 3 – CStringChecker.cpp, method evalCopyCommon

1201 // Invalidate the destination (regular invalidation without pointer-escaping

1202 // the address of the top-level region).

1203 // FIXME: Even if we can’t perfectly model the copy, we should see if we

1204 // can use LazyCompoundVals to copy the source values into the destination.

1205 // This would probably remove any existing bindings past the end of the

1206 // copied region, but that’s still an improvement over blank invalidation.

1207 state =

1208 InvalidateBuffer(C, state, Dest.Expression, C.getSVal(Dest.Expression),

1209 /IsSourceBuffer/ false, Size.Expression);

1210

1211 // Invalidate the source (const-invalidation without const-pointer-escaping

1212 // the address of the top-level region).

1213 state = InvalidateBuffer(C, state, Source.Expression,

1214 C.getSVal(Source.Expression),

1215 /IsSourceBuffer/ true, nullptr);