Having reread the docs, I have several questions.
- During the presentation you say that the order in which checker callbacks happen is not guaranteed by the analyzer as it explores the CFG. As far as I know, for example, PreCall will always be called before a PostCall event, and PreStmt always before a PostStmt. I don’t really understand what you were referring to.
It’s more that the analyzer engine is most likely not exploring one path at a time. Instead, it could easily be performing a breadth-first search of the state space. Even if it did do a plain DFS, it would eventually have to jump back to try another path. Checkers can’t tell the difference between any of these, so it’s not safe to store any information in the checkers (with the one exception of lazily-initialized data that applies for the whole analysis).
You’re right that along a single path you can guarantee that the PreCall check will always have happened by the time you reach a PostCall check, and that logically speaking that implies that the PreCall is processed “first”. But a lot more could have happened in the middle before you get to the PostCall. Or you may not get to the PostCall at all.
- Should a checker be interested only in the parameter being passed during a function call, I guess it wouldn’t make any difference whether checking the parameter in a PreCall event or in a PostCall event, would it? However, in this case, is it better to only register for the PreCall callback/event because of performance reasons?
It depends what your interest is. If you’re looking at the contents of memory referred to by a pointer argument, it makes a big difference whether you want to do such a thing pre-call or post-call, because it may have changed. If you’re establishing some kind of preconditions or postconditions for the call, it makes sense to do so as a pre-call or post-call check, respectively. Remember that there may be other pre-call checkers that run after yours, or other post-call checkers that run before yours.
- When tracking the use of values (variables) between callbacks, if needed, checkers must use the ProgramState as a means of preserving custom information. This is clear. It’s best to refer to those values by the underlying symbol (symbolic representation) created by the analyzer. In my case, I want to track the use of pointers to char (variables of type char *). In this case, the 1st argumento to checkBind callback will be a MemRegionVal. Reading the documentation, the counterpart of a symbol (SymbolRef in terms of the API) with regard to MemRegions is a SymbolicRegion.
Let’s consider the following:
char *s = “string literal”;
char pwd = “password”;
p = s; (1)
p = pwd; (2)
To be able to track the use of “p” --as in (1) and (2) above-- I was thinking of obtaining a symbol that represents that variable (memory region) and save that symbol in the ProgramState in case there’s a future reference to it in the program being analyzed, similar to the idea in the sample SimpleStreamChecker. Why in this case “p” is not a symbolic region? Reading the API, I thought the best way would be: getSymbolicBase() to be able to call getSymbol() on the result, but the former returns NULL. So, did I misunderstood and MemRegion is just the counterpart of a SymbolRef? If so, what would be the best way of saving a MemRegion’s symbolic representation in the ProgramState?
A symbol represents an unknown, constrainable value—you have a type, you can say “this symbol’s value is not 0”, but that’s about it.
A MemRegion represents, well, a region in memory—it can have things stored in it. If you have a location-typed symbol (a pointer or reference), then it has a corresponding SymbolicRegion representing the memory at that unknown address. Not all symbols have MemRegions, though—an integer symbol does not represent a pointer. Similarly, not all MemRegions have symbols: a VarRegion has a known address and behavior (if abstracted) and can be represented much more concretly than the memory corresponding to an arbitrary returned pointer. A FieldRegion isn’t a base region on its own; it’s a subregion of some struct-typed super-region.
Though the video doesn’t mention it, MemRegions are also persistent, meaning you can use them as keys in the ProgramState directly. However, you have to be careful—a cast is represented as an ElementRegion subregion with a different location type. Using a base region as a key should always be safe, though, if it makes sense for your particular use case.
In your particular example, ‘p’ represents a VarRegion whose location type is “char**” and whose value type is “char*”. The contents of ‘p’ after 1 is the address of ‘s’, which will be a loc::MemRegionVal referring to a StringRegion. The contents of ‘p’ after 2 is the address/start of ‘pwd’, which is a loc::MemRegionVal referring to a VarRegion of array type. None of the values are symbolic because the analyzer can model all of them fairly concretely.
Hope that helps,