Purpose of GenericTaintChecker

I’m looking to build a static taint analyzer before I found that it is already available in GenericTaintChecker.

However, I’m unsure of how to go about doing that. What I’m trying to achieve is to check if any tainted variables has been passed into sensitive functions.

My assumption is that one have to write additional code for:

  1. Adding taint to sources that are not defined in GenericTaintChecker through “addTaint”
  2. Write additional checks in checkPostStmt to see if any tainted sources are passed into sensitive functions by performing string matches on function name and check if the parameters passed in are tainted through “isTainted”

I’m really confused about what was the idea of the GenericTaintChecker and how is it meant to be used. Is it supposed to be used with other checkers that we have to write ourselves?

Below are the sources that I’ve read from but still do not fully understand them

What I'm trying to achieve is to check if any tainted variables has

been passed into sensitive functions.

The first "Aha!" here would be to realize that taint is not a property of a variable - it is a property of the value stored in it, and the analyzer's core engine allows you to easily work with values directly, without spending any effort to compute these values.

The analyzer denotes values which are not known during static analysis (such as values coming from user input) with *symbols* and performs algebraic operations on symbols. During program execution (or, equivalently, during analysis, a.k.a. "symbolic execution"), those symbols are passed around from one variable to another (through assignments etc. - that is, for instance, after declaration statement "int a = b;" both variables 'a' and 'b' hold the same symbol). Results of algebraic operations on tainted symbols are also considered to be tainted. Symbols read from tainted pointers are considered to be tainted themselves, etc.

GenericTaintChecker, aka alpha.security.taint.TaintPropagation as it's called in Checkers.td, is subscribed on certain function call events - such as, say, getc(). Their return values (etc. - say for scanf() it's values written into pointers passed as arguments) are denoted as symbols by the core. GenericTaintChecker takes these symbols and marks them as tainted.

Then the analyzer core models how these symbols move around during execution. No checker is responsible for that - it's done automagically. The core doesn't, most of the time, care if these symbols are tainted or not - it simply models operations on them. It makes no additional effort to mark results of algebraic operations on tainted values as tainted - it can compute taint of an algebraic symbolic expression by simply looking at the expression (if it references any tainted symbols). Same happens to symbols loaded from tainted pointers - *the hierarchy of symbols is designed to remember each symbol's origins in an out of the box manner*, so it's easy to see if any composite symbols are coming from a tainted source.

Whenever core encounters calls to other functions, which it doesn't model (say, because their bodies aren't available), their return values are not tainted even if arguments of the call are tainted: because otherwise we'd get a lot of false positives. So in case when we need to mark return values of functions as tainted depending on taintedness of arguments, GenericTaintChecker is responsible for modeling that. This is the "taint propagation" thing. For instance, taint propagates through strcat(), which allows us to theoretically catch SQL injections.

Finally, tainted symbols may reach sensitive functions. For example, tainted input string in call to system() allows execution of arbitrary code. This is the *third* kind of functions on which GenericTaintChecker is subscribed - upon noticing tainted arguments passed to such functions, it issues warnings.

If you want to extend this functionality by adding your own:
(1) Taint sources,
(2) Taint propagation rules,
(3) Warnings for tainted value usage,
Then you can either extend the relevant section of GenericTaintChecker, or write your own checker - it doesn't really matter, because taint information is visible to all checkers. It might be more comfortable to extend GenericTaintChecker because it allows some code re-use. If you write your own taint checker, you can either use it together with GenericTaintChecker (its work on taint sources and taint propagation may be of use) or disable GenericTaintChecker completely (say, if you don't want to see its warnings).

The analyzer analyzes every statement in the CFG, first calling checkPreStmt for it, then "executing" the statement, then calling checkPostStmt. The "executing" phase may include other callbacks, such as checkBind or checkPreCall/checkPostCall. CFG terminators, such as `if()` or `for()` or `?:` or `&&`, are not covered with checkPreStmt/checkPostStmt, but with checkBranchCondition.

If you want to observe how taint flows, you can enable the debug.TaintTest checker - it warns on all expressions values of which are tainted.

If it's still not obvious how GenericTaintChecker operates, you can have a look at its unit tests in (test/Analysis/taint-generic.c) and try to figure out how's that different with your case. Like, gets() itself isn't a bug, so no warning here; but system(gets()) is already suspicious.

Yeah, taint information is the same for all checkers: if GenericTaintChecker is enabled, other checkers would see symbols marked by it as tainted as tainted.

In any case, i encourage you to share more details on what you're doing, because i've no way of guessing what subtle misunderstanding may be blocking you. Eg., you may be confusing value of the symbolic pointer and symbolic value behind that pointer, which is very easy to mess up when working with symbols before you collect some intuition/experience, or something like that, there may be a lot of problems when you just start. Since the previous thread, i finally published some guide for beginners (https://github.com/haoNoQ/clang-analyzer-guide/releases/download/v0.1/clang-analyzer-guide-v0.1.pdf) - though it doesn't contain much on taint (we're already past it, worth improving i guess), it might save some time on understanding the ideas behind the analyzer.

Another problem you might have encountered - why you don't see taint on a CallExpr instantly - is because there's a race between your checker and GenericTaintChecker, both of which are subscribing on check::PostStmt<CallExpr>. Order of calling different checkers on the same callback is currently undefined (though we realize that it's a good idea to create dependencies between checkers). Most of the time it doesn't matter though, because one usually catches tainted values on checkPreStmt's. But i've no way of guessing if that's what you're doing.