Clang GenericTaintChecker limitations

Hi All,

I am looking for an open source static taint analysis tool that I can run on some applications to reason about security properties – just to check if a tainted value can flow to some function parameters etc. The programs I want to try this on are around 10-20K lines of C code. I was thinking of using Clang’s GenericTaintChecker (and just modifying the taint sources) for this purpose. I’d like to know if there are any limitations to this analysis that I should be aware of.

I know that the interprocedural analysis doesn’t work across translation units, but I’v managed to merge my source files using the cilly tool. I was mainly wondering about the precision of the taint analysis (what sort of pointer/alias analysis the IPA uses etc). If you could point me to any documentation that discusses the memory model, that would be great.

Is the clang taint checker considered the state-of-the-art in open-source taint checking tools or is there something that is considered better (more precise)?

Thanks,
Divya Muthukumaran
Research Associate
Department of Computing
Imperial College London

Hello,

The taint analysis we have here is not perfect, but it's pretty sane.

The analyzer assigns symbols to memory regions to represent their values at a given moment of time, passes symbols around through assignments. Then, some symbols carry taint, and symbolic expressions composed with them automatically inherit the taint.

The GenericTaintChecker performs propagation of taint through functions. For example, if you put pointers to tainted values into strcat(), then the symbol that represents the value behind the returned pointer would also be tainted, but unlike propagation from atomic symbols to expressions, this is not automagical - a checker needs to do the work, to support any specific API.

The memory model currently assumes that there is no aliasing between unknown pointers, which is another limitation. However, inter-procedural analysis works through inlining: when the function is inlined, any aliasing between its actual arguments is correctly taken into account (eg. the call of `f(&x, &x)' is modeled correctly, even though f assumes that its arguments do not alias when analyzed separately).

One of the limitation that might bite you is lack of support for floating-point values - the analyzer doesn't yet symbolicate them, so they cannot be tainted.

One thing you'd probably need is to understand how structures are modeled - eg. there's a symbol for the structure or array and the symbol for its field or element, and there are multiple methods used for representing this relationship, depending on circumstances.

I'm not aware of any other powerful open-source static analysis tools for C/C++, but you might have a look at KLEE, which is not exactly static, but also implements symbolic execution.

You may want to check an earlier discussion:
http://lists.llvm.org/pipermail/cfe-dev/2016-April/048250.html
http://lists.llvm.org/pipermail/cfe-dev/2016-April/048243.html
http://lists.llvm.org/pipermail/cfe-dev/2016-April/048363.html

Hi Artem,

Thanks for your detailed reply. That was really helpful and got me started. I also read the paper http://lcs.ios.ac.cn/~xuzb/canalyze/memmodel.pdf which gave me a good overview of how the region store works.

I do have a follow up question regarding propagation of taint through globals. I was trying to apply the Clang taint analysis to a toy key-value store program. The interface consists of item * item_alloc(int key, int value), item * item_get (int key), item_put (item*) and item_list(void) and the key-value pairs are stored as a hashtable (item_table) with global scope. A function_dispatcher() reads in user input on loop and dispatches between those functions until the user enters ‘quit’.

I was expecting that if I taint the item being added to the global item_table when item_put() is invoked, then when item_get() is used to retrieve any item, the resulting item gets the taint. But this doesn’t seem to happen.
It works correctly if I explicitly call item_alloc(), *taint_add(), item_put(), item_get() in that order from my main() without using the dispatcher.

Does this mean that the global region pertaining to item_table gets invalidated between user requests. Is there any way I can change this behavior? I guess my alternative is to write a driver with all possible interleavings of my interface functions, correct?

Best,
Divya

Hi!

Note that the analyzer do not reason about global variables right now.

And also not that there are no guarantees about the coverage. Therr might be code that is not covered by the analysis at all. So I think, as is, the analyzer is not suitable for any kind of verification. It can be a decent bug finding tool though.

Regards,
Gábor

Note that the analyzer do not reason about global variables right now.

@Gábor: Hmm, what do you mean? :o They're present in the Store and work like all other variables, they're just invalidated too often (on every unmodeled function call). If the variables are also const-qualified, then they shouldn't be invalidated, and should always resolve to their initial value (though i think there were some bugs there).

@Divya: if you think that your own API functions themselves do unnecessary invalidation (rather than user-defined functions or library functions), then you have an option to `evalCall` them - that's a special checker callback in which you can take care of all modeling, but

And also not that there are no guarantees about the coverage. Therr might be code that is not covered by the analysis at all.

@Gábor: Yeah, it might be that as well. The loop might have been to complex, and the analyzer didn't find the proper path through the loop (loops are currently inlined as well.

@Divya: you may want to increase the `-cc1 -analyzer-max-loop=4` option to a higher value). In the worst case, i'd have had a look at the ExplodedGraph (http://clang-analyzer.llvm.org/checker_dev_manual.html#visualizing) to see what exactly is going on.

It might also easily be something else, so if you can post some sample code, we'd probably make a better guess.

Hi Artem,

I’m not sure what the protocol is for posting code here. I was trying to abstract the behaviour of a well-known

in memory key-value store into the following program so this may be too much code to post here. Let me know
if you want me to give you an even more abstract version. And again, thanks for your help!

/----------- test.c -------------/

#include<stdio.h>

#include<string.h>

#include<stdlib.h>

typedef struct _item{

int key;

char value;

struct _item * next;

} item;

void taint_add(item ** it);

item * global_item_list = NULL;

item * alloc_item(int key, char value) {

item * new_item = malloc(sizeof(item));

new_item->key = key;

new_item->value = value;

new_item->next = NULL;

return new_item;

}

item * get_item(int key) {

item * list = global_item_list; ---- (D)

while(list) {

if(key == list->key)

return list;

list = list->next;

}

return NULL;

}

void put_item(item * new_item) {

if(!global_item_list) {

global_item_list = new_item; ----- (C)

}

else {

new_item->next = global_item_list;

global_item_list = new_item;

}

return;

}

int dispatch_op(char * command_str) {

char *token;

const char s[2] = " ";

token = strtok(command_str, s);

int key;

char value;

item * it;

if(token == NULL) {

printf (“[Error] No command entered.”

“Please enter a valid command\n”);

return 1;

}

if(!strcmp(token, “SET”)) {

/* Grab the key */

int key;

token = strtok(NULL, s);

if( token == NULL ) {

printf (“[Error] Invalid number of arguments.”

“Please reenter command with correct params\n”);

return 1;

} else {

key = atoi(token);

if(key == 0) {

printf(“[Error] Key must be greater than 0\n”);

return 1;

}

}

/* Grab the value */

char value;

token = strtok(NULL, s);

if( token == NULL ) {

printf (“[Error] Invalid number of arguments.”

“Please reenter command with correct params\n”);

return 1;

} else {

value = *token;

}

item * it = alloc_item(key,value);

taint_add(&it); ---- (A)

put_item(it); ----- (B)

printf(“[Success] Item added.\n”);

return 1;

} else if (!strcmp(token,“GET”)) {

/* Grab the key */

int key;

token = strtok(NULL, s);

if( token == NULL ) {

printf (“[Error] Invalid number of arguments.”

“Please reenter command with correct params\n”);

return 1;

} else {

key = atoi(token);

}

item *it = get_item(key); — (E)

if(it != NULL)

printf(“[Success] Item Found: Key %d, Value %c\n”, it->key, it->value);

else printf(“[Failure] Item Not found.\n”);

return 1;

} else if (!strcmp(token, “QUIT”)) {

printf(“GoodBye.\n”);

return 0;

}

else {

printf(“[Error] Unknown command: Enter a valid command\n”);

return 1;

}

}

/* Command are GET, SET, QUIT */

void process_command() {

char *buffer = NULL;

int read;

unsigned int len;

do {

do {

printf("Please Enter a command: "

“GET key::int(>0) | SET key::int(>0) val::char | LIST | QUIT\n”);

read = getline(&buffer, &len, stdin);

buffer[strcspn(buffer, “\n”)] = 0;

} while (-1 == read);

} while(dispatch_op(buffer));

return;

}

int main(int argc, char ** argv){

/* item * A = alloc_item(1,‘A’);

item * B = alloc_item(2,‘B’);

taint_add(&B); – (P)

put_item(A);

put_item(B);

item * it = get_item(2); – (Q)

printf(“Value of %d is %c\n”, it->key, it->value); --(R) */

process_command();

return 0;

}

I run the analysis using the command: /llvm-install/bin/scan-build -k -enable-checker alpha.security.taint.TaintPropagation -enable-checker debug.TaintTest clang -cc1 -analyzer-max-loop=10 test.c
I have added the function taint_add() as one of the sources of taint inside GenericTaintChecker. I was expecting to see the taint flow from A → B → C → D → E highlighted above with the reasoning that
the dispatch_op() function can multiplex between SET(put_item) and GET(get_item) requests and taint introduced though the SET request can flow to global variable and then the result of the PUT request
but this doesn’t happen.

However, if I uncomment the highlighted commented lines in main, I can see that taint flow from P → Q → R.

Woohoo, i found something. The taint starts propagating correctly as soon as I add

     global_item_list = NULL;

to the beginning of main().

It is true that the analyzer doesn't assume globals are initialized to their initializers. But i never noticed that it is so even when the analysis starts from main(). I think it'd be a nice feature to improve upon (not hard, but will need some coding - in RegionStore, add a special mode that completely changes its behavior on denoting non-bindings with symbols whenever we think the analysis starts from the very beginning of the program; probably assume C++ global constructors fire at that time as well, but for that scenario we'd only be able to support static globals).

Will keep investigating, because i don't feel as if i understand what's going on yet.

All right, i see. Because the analyzer failed to initialize the global, upon analyzing get_item() as first command, it fails to inline get_item() because of the potentially infinite loop inside it. Then, upon analyzing get_item() after put_item(), it recalls that get_item() is too complex to inline, and skips the call as if no body is available (models the call "conservatively"). The return value of get_item() is conjured up, and therefore carries no taint.

So, long story short, this code is already too complex for our analyzer. Our default options are tweaked for maximum bugs-per-second in general case, but maybe we could make an option to analyze deeply, no matter how much time it takes.

For the reference, here's my test.c file and the way i patched GenericTaintChecker when tried to mimic your approach. I run it with debug.ExprInspection and without debug.TaintGeneric and produce a trimmed exploded graph with -analyzer-viz-egraph-graphviz -trim-egraph (the last option trims the exploded graph to keep only the path to the warning, which is in our case the debug.ExprInspection warning that says that the value we're analyzing is conjured rather than modeled properly; i added an extra variable to reduce the possible paths).

test.c (3.33 KB)

GenericTaint.patch (1.65 KB)

Hi Artem,

The attached patch highlights the code responsible for conservative replay without inlining. It's not very complex, and i guess we could make an option for tweaking this particular behavior.

DisableConservativeWhenInliningFails.patch (2.38 KB)

Hi Artem,

Thanks for sending that – I commented out those lines in the source and recompiled and the taint propagates correctly for the example I sent you.
But then I changed the code slightly, so that instead of storing the items in a list, I store them in an array. New items are added to an offset of the
array address (i.e., array[OFFSET] = new_item). Now the taint doesn’t propagate at all. Is there something about array modelling that is different?

I have attached my revised test case with this email. Note that if I replace A1 with A2 and B1 with B2 or hard code the offset into the array
the taint propagates correctly.

Thanks,
Divya

test.c (3.21 KB)

Hmm, that's because your globals are invalidated on every conservatively evaluated call, including taint_add(). You might need to evalCall taint_add(). To emulate how it would work, you can replace the taint_add() prototype with:

   void taint_add(item **it) {}

(empty body, but post-call still happens, so GenericTaint still operates correctly).

DoNotInvalidateOnBuiltins.patch (619 Bytes)