Adding taint sources to GenericTaintChecker

The taint sources already included are user input functions like getchar,scanf,etc. My requirement is to mark a few function return values as taint sources at the source level. For ex, in this case readval.

int readval()
{
return 10;
}
int a,b;
a = readval() // warning : tainted
b = a+1 //warning : tainted

Hence , I added this line .Case(“readval”, &GenericTaintChecker::postRetTaint)to addSourcesPost function in GenericTaintChecker.cpp and built clang again . But, variable ‘a’ isn’t getting tainted when I run scan-build -enable-checker alpha.security.taint.TaintPropagation,debug.TaintTest clang readval.c -o rv . Is there anything else I need to take care?

Regards

int readval()

> {
> return 10;
> }
>
> int a,b;
> a = readval() // warning : tainted
> b = a+1 //warning : tainted

In your example, readval() returns 10. Our analysis is inter-procedural, so it knows such things.

10 is a concrete value. A concrete value cannot be tainted - an attacker cannot forge 10 to become 20, or something like that. It's just "the" 10, and all 10's are the same. Something is tainted if it's a user input or is anyhow known to be able to take completely arbitrary values; 10 is not an input from the user, and it's quite under our control. So the analyzer knows for sure that readval() returns a value that cannot be tainted, and the message from the checker gets ignored - this is expressed by the fact that the analyzer was unable to obtain a symbol from the value provided by the checker, because the value is concrete.

In fact, only *symbols* may be "truly" tainted. To be exact, addTaint() works with SymExpr's (SymbolRef's) and, additionally, SymbolicRegion's (which are essentially regions pointed to by SymExpr pointers). isTainted() works on SymExpr's, SymbolicRegion's and their sub-regions, and additionally on SVal's of class nonloc::SymbolVal, loc::MemRegionVal, nonloc::LocAsInteger whenever they contain a SymExpr or a SymbolicRegion or its sub-region.

If i replace your definition of readval() with an opaque forward declaration, eg:

   int readval();
   void foo() {
     int a = readval() // warning : tainted
   }

then everything works as expected.

On the other hand, if the definition of readval() is truly available in your translation unit, then you don't need to add *it* to GenericTaintChecker - instead, add whatever readval() calls to obtain the user input, and the analyzer would model readval() itself and pass the symbol down to the caller.

Thanks for the explanation . The use case I had in mind was that readval is a function loaded from a dynamic library and it returns some value which is critical ( may or may not depend on external input). But since the programmer knows it is critical in some sense , he might want to track the flow of the return value through taint propagation and dump all instructions which access variables/memory locations that depend on the initial critical sources. Is there anyway by which I can guarantee those initial return values to be tainted?

Regards,
Ashwin

There shouldn't be a problem unless these values are compile-time constants.

There might be a bit tricky (though not very hard) to determine if the correct function is called, in case it's actually loaded from a dynamic library and passed around as a pointer, but that's a different story. If the loading process is hidden in another translation unit, and the end user receives only a forward declaration of readval(), that shouldn't be a problem.

Okay cool. One thing I have observed that in assignment statements is that, if a warning such as ‘Division by a tainted value, possibly zero’ occurs , then the taint is not propagated to the left hand side. But , there seems to be some discrepancy when I use float instead of int, the taint doesn’t propagate at all . Does the type matter?

int readl()
{
float a;
scanf("%f",&a);
return a;
}

int main()
{
float a,b,c,d,e;
a = readl();
printf("%f",a);
b = a*2.0;
c = (b+1.0)*100.0;
d = (c-1.0)/5.0;
return 0;
}

The taint doesn’t propagate in the above case( I don’t think even ‘a’ gets tainted) but if i change everything to int, then it works . Is it dependent on the variable type?

Regards,
Ashwin

Umm yeah, and then we run into a bit of an issue - the analyzer does not use symbols for floats yet. The problem was that constraints on floats are not yet supported, otherwise there shouldn't be a problem to replace UnknownVal's with float-type symbols. This is item #2 in http://clang-analyzer.llvm.org/open_projects.html . Maybe we should go ahead and produce symbols anyway, and ignore them in the constraint manager, solely for the purposes of taint analysis, before we have anything better? Not sure right now what kinds of false positives we may unleash this way.

Oh, is someone currently working on this? What would be the basic workflow if anyone has to start working on adding support for float , atleast taint analysis to start with?

Oh, is someone currently working on this? What would be the basic

workflow
> if anyone has to start working on adding support for float , atleast taint
> analysis to start with?

We had a quick look in our team, but we don't have a ready-made patch.

First, you'd need a sub-class of SVal to store concrete floating-point values, i.e. nonloc::ConcreteFloat that is similar to nonloc::ConcreteInt but wraps around APFloat. You'd also need SymFloatExpr and FloatSymExpr symbolic expression classes, similarly to SymIntExpr and IntSymExpr but holding APFloat as their RHS or LHS respectively. These symbolic expressions need to be considered in a few places, eg. they need to inherit taint from their symbolic operand.

And after that probably I'd start with making ExprEngine and SValBuilder produce less UnknownVal's for float-type stuff - instead, produce a relevant atomic symbol or symbolic expression - and probably add a few stubs into RangeConstraintManager to avoid simplifying these symbols. If this doesn't cause many new false positives, then you should be already ok with it, as it should be enough to let taint analysis work.

The final step would be to let RangeConstraintManager actually reason about floats, i.e. transform "(conj_$1<float> >= 0.0): [1, 255]" into "conj_$1<float>: [0.0, +inf]" (or whatever the current floating-point semantics thinks about infinity). This would finish the open project.