[analyzer][taint] More precise taint modelling on arrays

I made the following test case for checking the modeling of taint propagation on the strcpy function.
As I observed, only the first byte of the array became tainted, even though all bytes should be treated tainted.
In the test, you can see my expectations and the actual result.

void strcpy_unbounded_tainted_buffer(char *buf) {
scanf("%s", buf);

char dst[32];
strcpy(dst, buf); // expected---vvv vvv--- actual
clang_analyzer_isTainted_char(dst[0]); // expected-warning{{YES}} YES
clang_analyzer_isTainted_char(dst[1]); // expected-warning{{YES}} NO
clang_analyzer_isTainted_char(dst[31]); // expected-warning{{YES}} NO
}

void strcpy_bounded_tainted_buffer(char *buf) {
scanf("%s", buf);
buf[10] = '\0';
clang_analyzer_isTainted_char(buf[0]); // expected-warning{{YES}} YES
clang_analyzer_isTainted_char(buf[1]); // expected-warning{{YES}} NO
clang_analyzer_isTainted_char(buf[10]); // expected-warning{{NO}} NO
clang_analyzer_isTainted_char(buf[20]); // expected-warning{{YES}} NO

char dst[32];
strcpy(dst, buf);
clang_analyzer_isTainted_char(dst[0]); // expected-warning{{YES}} YES
clang_analyzer_isTainted_char(dst[1]); // expected-warning{{YES}} NO
clang_analyzer_isTainted_char(dst[10]); // expected-warning{{NO}} NO
clang_analyzer_isTainted_char(dst[20]); // expected-warning{{NO}} NO
}

Some clarification about TaintedSubRegions and tainting nonloc::LazyCompoundVals would be also helpful since it might be related to this topic.

What are the reasons for this limitation on modeling taintedness regarding arrays?

Background and expectation:
This change would be the first step in migrating the diagnostic emitting parts of the GenericTaintChecker.
Eg.: checkUncontrolledFormatString, checkSystemCall, checkTaintedBufferSize.
As a result, multiple checkers will consume taintedness information for reporting warnings in the future and letting the GenericTaintChecker do only modeling and propagation.

Regards, Balazs.

It's not much of a limitation, just a bug. When `scanf` produces a default conjured symbol binding in the buffer, this is *the* symbol that should carry the taint; all derived symbols will automatically be treated as tainted. If we're being even more precise and consider the length of the scanned string, then the derived symbol will need to carry partial taint (i.e., specify TaintedSubRegions such that derived<conj,

is tainted iff R is a TaintedSubRegion of buf for conj). I.e., what

i'm saying is that RegionStore is more-or-less flexible enough to implement this correctly, it's only a matter of getting this done. Those string functions are just too many and each of them requires individual attention.

Thank you for the clarification, Artem.

I’m going to have a look into that.

Artem Dergachev <noqnoqneo@gmail.com> ezt írta (időpont: 2020. febr. 11., K, 17:27):

Artem, it seems to be a similiar problem in this example:

void main(int argc, char **argv) {
char const *inFileName = “input_file.bin”;
FILE *inFile=fopen(inFileName,“r”);
struct stat st;
stat(inFileName, &st);
char *inBuf;
int inBufSize = st.st_size;
inBuf = (char *)malloc(inBufSize);
fread(inBuf,1,inBufSize,inFile);
clang_analyzer_isTainted_char(inBuf[0]); // expected-warning{{YES}} YES
clang_analyzer_isTainted_char(inBuf[1]); // expected-warning{{YES}} NO
}

So is it right to enhance isTainted() API check by checking condition “if 1st element of region is tainted when all elements are tainted”?

I’ll have a look at this next week. I have much more experience now than I did at that time. I should be able to fix this quite quickly.

Please let us know. This improvement will be very useful.

It seems like there are two solutions to this problem.

  1. ‘easy’ & inaccurate
  2. hard & accurate

About the “(1) ‘easy’ & inaccurate” solution:
We just take the base region of the pointee and mark that tainted instead of marking the individual element tainted.

This would be in line with our assumption that the called function might actually access any subobjects part of the enclosing object via fancy pointer arithmetics. However, it would mean that can no longer mark only a subobject tainted.

Consider this example:

int nums[10] = {0}; // zero initialize the whole array
scanf("%d", &nums[1]);
isTainted(nums[0]); // one would expect 'no'
isTainted(nums[1]); // definitely 'yes'
isTainted(nums[2]); // one would expect 'no'

So, by marking the whole base region tainted, all of the elements of the array would be also considered tainted. Since the API description for “scanf” only says that it marks all variadic parameters tainted, we would need to find a way to extend it somehow.

But consider this now:

char str[10] = {0}; // zero initialize the whole array
scanf("%s", &str[1]);
isTainted(str[0]); // one would expect 'no'
isTainted(str[1]); // definitely 'yes'
isTainted(str[2]); // probably 'yes'

This example is pretty much the same, except that now we read a string.
For the same API (scanf()), we expect completely different behavior.
In this case, it would probably make more sense to mark the whole array tainted.
There are other functions, that accept some sort of //length// parameter. Such as this:

read(sock, &nums[1], sizeof(int) * 4); // reads into 'num[1..4]'

In this example, what we would want to express in the API description is that read() will only store 4 integers starting at index 1 of the array nums.
So, we should consider the range nums[1..4] tainted, but nothing more.
One could probably generalize this to support at most one symbolic offset, in the same way as we do with the region store [1].

And this would be the “(2) hard & accurate” solution to the problem.

By looking at the complexity that it would take to implement this, I’m wondering if it’s worth it. After thinking about, all this I would rather stick to marking the whole base region as tainted and live with false-positives that are coming from niche cases. For now, I have this patch file, experimenting with the first idea.
taint-base-regions.patch (5.9 KB)

First, I think we should decide if this is the approach we want to take if any.
WDYT @NoQ
@Xazax-hun @gamesh411

[1] region store

I said ‘easy’ for the first proposed solution, but in fact, I believe it’s not that easy unless you are familiar with LazyCompoundVals, DerivedSymbols, and how invalidation and taint work in the engine. But it’s still easier than the second direction.

We should err on the side of false negatives. I’d rather leave this problem unresolved than go for a solution that assigns taint to too many regions.

I think you already got that right in your experimental patch but I want to make it clear again, that we should not put the taint on regions, ever. Taint is a property of the data, not of the location in which the data is written. Location can be overwritten, a region that previously contained tainted data would then contain regular data. A region can be a symbolic region based on a tainted pointer symbol; in this case when such region is interpreted as a pointer value, it should be a tainted value. But that has nothing to do with the data in that region: even if the pointer is tainted, we can still overwrite it with sanitized data later.

So, again, the simple solution is to find the invalidation marker (conjured symbol) corresponding to scanf and put taint on that symbol. In case of scanf() into the beginning of the base region, we can simply do that. In case of scanf() that doesn’t start at the beginning of the base region, or a read() of limited size, we cannot do that, that’d be incorrect and cause false positives.

Yes, I think the proper solution is to have RegionStore bindings have limited span, so that the invalidation marker spanned only as far as necessary. Then the invalidation marker could carry taint according to its span. We could still go with the solution that I suggested previously, and in fact “partially tainted symbols” have already been implemented, so we could extend that solution with offset ranges to handle the more complicated cases. But we have to solve the same problem in RegionStore anyway and it’d be much more valuable.

Thanks for the response, I think we are on the same page.

Yup, clear. I should have been phrasing it like that indeed. This is what I meant.

Yes, I agree.

I was thinking about that. It is probably the way forward. However, I’m not planning to work on this.

Ah, yes. Very much so. I’d need to think about to see how the store actually relates to the partial taints.
If it’s more about the imposed complexity, then yes. We would be better of with a simpler memory model, which could probably simplify things here on the taint analysis side as well.