Hi, everyone. This is the GSoC proposal for the Clang Static Analyzer. Please let me know if you have any feedback. Thank you!
Project Description
Name: [Clang] Improve and stabilize the static analyzer’s “taint analysis” checks
Summary: The Clang static analyzer is a source code analysis tool that is included in the Clang project and Xcode. While the analysis result is accurate with simple test cases, false positives and negatives are more likely to occur with larger programs. The goal of this project is to improve the static analysis provided by the Clang static analyzer, especially focusing on the experimental taint analysis. Various test cases and real-world programs will be tested with the Clang static analyzer.
Personal Information
Name: Juhee Kim
Email: kimjuhi96@snu.ac.kr
Location: Seoul, South Korea (UTC+9)
Affiliation: Seoul National University
Goals
I found several issues described below when using the up-to-date Clang static analyzer with a few toy and real-world programs. In this project, I want to focus on addressing these issues.
A. Enhancing taint analysis
I discovered several false negatives of taint analysis that do not seem to be too difficult to be fixed.
A-1. Memcpy of partially tainted objects.
When a partially tainted object (e.g., struct) is copied to another object through a memcpy-like library call (e.g., memcpy, strcpy), the taint is not propagated properly.
When the data flow from source to sink contains either a pointer to integer cast or an integer to pointer cast, the taint is not propagated.
char *ptr = source();
sink((long long) ptr)); // not detected!
long long val = source();
sink((void*) val); // not detected!
A-3. Unsafe data flow when sink overlaps with the safe data flow.
When safe and unsafe data flows share a sink at the same programming point, and they follow different control paths (different function call sites), the unsafe data flow is not detected.
int global = 0;
void sink_global() {
sink(global);
}
void set_global(int a) {
global = a;
}
int main() {
int a = source();
sink_global();
set_global(a);
sink_global(); // not detected!
}
This behavior is odd since the unsafe data flow is detected from when the control paths differ by branch.
int a = source();
int b;
if (a == 0) {
b = 1;
} else {
b = a;
}
sink(b); // unsafe data-flow detected!
}
B. Enhancing null-pointer dereference analysis with taint analysis
I discovered common false positives with the null pointer dereference analysis on real-world programs. When there is a null pointer check on a non-null pointer followed by a pointer dereferences without a null check, the null deference analysis warns the memory access.
int foo(int a, char **p) {
if (a > 0) {
*p = malloc(a);
return a;
} else {
*p = nullptr;
return 0;
}
}
int main() {
...
foo(a, &p);
if (a == 0) {
return; // p is null
}
if (p && // unnecessary null check (p is always non-null)
p[a-1]=='0') {
len = strlen(p);
}
char x = p[0]; // warning: null-dereference!
}
We may be able to reduce this false positive case with the taint analysis. We can set the possible null pointers as sources and the pointer used in the memory dereference as a sink. If there is no data flow from the possible null pointers to the dereferenced pointer, the memory dereference is not a null dereference. If the set of possible null pointers is too large, we may try to reduce it with a reachability analysis.
C. Expanding the analysis scope of taint analysis
Clang static analysis by default is performed on a single translation unit (a file). Given the possibility of unsafe data flows across translation units, a reliable taint analysis should examine such data flows. With the available cross-translation unit (CTU) analysis tools (2.1. Cross Translation Unit (CTU) Analysis — Clang 17.0.0git documentation), we can apply taint analysis across multiple translation units, discover any issues, and fix them if possible.
Thanks for showing us the draft proposal and the existing issues.
Since you didn’t give compilable example code for the first two kinds of issues. I tried to play with my own toy code to reproduce them.
#include <cstdio>
#include <cstdlib>
#include <cstring>
using namespace std;
struct Obj {
long long x = 0;
char s[20];
int y = 0;
float f = 0.;
};
int main() {
Obj obj1, obj2;
// memcpy of partially tainted objects, fail to detect
fgets(obj1.s, 20, stdin);
memcpy(&obj2, &obj1, sizeof(Obj));
system(obj2.s);
// cast char * to long long, I think there is no taint propagation
float div0 = 1 / (long long)obj1.s;
printf("address of obj1.s: %lld\n", (long long)obj1.s);
printf("%f\n", div0);
scanf("%lld", &obj1.x);
// cast long long to char *, cannot detect
system((char *)(obj1.x));
return 0;
}
Build the example code with scan-build -enable-checker alpha.security.taint.TaintPropagation -V clang++ filename.cpp. It says no bugs are found. So I can reproduce the two kinds of issues.
However, for the type cast issue, I don’t find cast from source pointer to int can propagate the paint because the integer will only contain the address of source pointer.
It seems like your example cannot detect data flow from scanf to system, although it doesn’t make sense to cast integer to pointer. In fact, it is a serious security bug since untrusted user input can control the system call argument.
I think the taint analyzer should propagate taint in integer-pointer casts. As Clang taint analysis supports user-defined sink/source configuration, the user may want to track a pointer returned from the sink, and it is possible that there is a pointer to integer cast (for pointer arithmetic, etc). Even though the program casts a pointer to integer or vice versa for no reason, it is better to track them.
Hi, looks like you’ve already explored it quite a bit, very nice! That’s a lot of interesting examples to cover.
It’s pretty likely that we’ll see a variety of defects of this kind. What we’ll probably have to do is ask ourselves which ones are critical (the feature can’t be declared stable for everyday use without fixing them) and which ones aren’t (the feature can certainly be improved further, but it’s already useful as-is).
The static analyzer isn’t a “verification” tool; it’s a “bug-finding” tool. Basically this means that our users aren’t interested in spending a year to fix all 20 bugs in their code and have a proof that no bugs remain (something you’d love to do if your software is controlling, say, a Mars rover), but they’re interested in spending an hour to fix 5 of those bugs, even if there are still 15 left (something you’d love to do if you’re making a mobile connect-three game, when the cost of mistake is very low).
So our highest priority is eliminating false positives (when the tool emits a warning, but the code under analysis does not need fixing). False positives consume user’s time without giving anything in return, making the tool costly to use. They may also cause the user to misunderstand the code and introduce more bugs trying to address the warnings.
And we’re usually paying less attention to false negatives (when the code under analysis needs fixing, but the tool doesn’t emit a warning). Yes, false negatives mean we miss bugs, but the users are still grateful for the ones that we do find, and the tool remains cheap to use.
Even though this entire project is addressing about false negatives (“the tool currently doesn’t warn about attacker-controlled inputs passed to sensitive functions, but it arguably should”), our obligation to our users is that we should not introduce false positives while fixing false negatives. Sometimes it’s inevitable but generally we shouldn’t introduce too many false positives, and ideally we shouldn’t introduce them knowingly.
So in order to declare taint checkers stable, we will, first and foremost, focus on making sure that the new checker doesn’t have false positives (or doesn’t cause other checkers to emit false positives, because all checkers are interconnected).
In this sense, items A-1 … A-3 in your report are basically false negatives (like you said, “not detected!”). We may have time to look into them, but that’s not going to be our primary objective. I’m really curious whether you’ve found any false positives, any incorrect reports in your investigation. Such as:
The analyzer thinks that data is attacker-controlled, but it’s actually not attacker-controlled;
The analyzer thinks that data is passed to sensitive function, but the function isn’t actually sensitive;
The analyzer thinks that data is unsanitized, but it’s actually sufficiently sanitized;
The analyzer thinks that attacker-controlled data travels from one variable to another throughout the execution path in a certain way, but it actually travels in a different way;
The analyzer thinks it has found a potential execution path with a vulnerability, but such execution path is impossible at runtime for some reason;
The analyzer is correct that unsanitized attacker-controlled data is passed into a sensitive function on the very real execution path it explored and described in notes, but it’s not describing it well enough for the user to see the problem from the warning;
and so on. This is what we’re after. I’m particularly worried about (3), as I’ve seen a few of these myself last time I tried to run it.
In the general case, this is the desired behavior, i.e. a true positive. If the pointer can be null, then your code is crashing. If the pointer is never null, then the null check is dead code, which isn’t a crash but it is a different class of problems with the code that our tool also detects. The analyzer’s warnings are (ideally) worded in such a way that it’s clear that it could go either way (godbolt):
<source>:19:7: note: Assuming 'p' is null
if (p &&
^
<source>:22:12: note: Array access (from variable 'p') results in a null pointer dereference
char x = p[0];
^
So it’s not saying “this code has null dereference”. It’s saying “this code has null dereference assuming ‘p’ was null three lines above”. Which may be a “vacuous truth” if the assumption is always invalid, but even in this case it can be pointed out that the code does not make sense and needs fixing - just for a different reason. So we’ve obtained a solid proof that the code needs fixing, and correctly presented that proof to the user.
And the analyzer doesn’t help you decide what’s the right fix, it simply explains the problem. And now you see why it fundamentally cannot help you decide that. Unlike linter tools that focus on simple one-liner problems, the analyzer is great specifically because it finds problems that connect multiple lines of code together, these problems cannot be discovered by looking at the same lines of code in isolation from each other. So inevitably, a change in one individual line of code alone may eliminate the entire problem. In our case, either remove the dereference, or remove the check. Or do something else entirely, like add an assert somewhere in the middle. Or refactor your entire project to avoid such ambiguities. You’re the author, you do you.
Now, to your example specifically: I still believe this is a true positive and a more serious issue than just “dead code”. I think what you’re missing is that malloc() is not guaranteed to return a non-null pointer. It is allowed to return null under certain circumstances (say, your machine ran out of memory). Different platforms behave differently in this regard, and many people actively choose to ignore this problem, so the analyzer wouldn’t warn you if you dereference a pointer produced by malloc() without the check. But since you do have a check in your code, it’s a good indication that you’re interested in covering this possibility. Notice how the warning disappears if you add assert(*p) to make sure the freshly allocated pointer definitely can’t be null (godbolt). The warning also disappears when you use operator new which, unlike malloc(), cannot return null (godbolt). So the serious issue that the analyzer has correctly identified is, “the code does not properly check the return value of malloc, even though it clearly attempts to”. Which is, as promised, more serious than just “dead code”.
Also note that !(a > 0) and !(a == 0) aren’t mutually exclusive, I had to change one of them in my examples. If I change it back, the warning correctly comes back, even in the safe case (godbolt). And notice how the notes are different:
<source>:9:6: note: Null pointer value stored to 'p'
*p = nullptr;
^~~~~~~~~~~~
<source>:20:7: note: 'p' is null
if (p &&
^
<source>:23:8: note: Array access (from variable 'p') results in a null pointer dereference
p[0] = 1; // warning: null-dereference!
~ ^
There’s no “Assuming” anymore! The analyzer now knows that on the execution path it’s describing, p is definitely null. That’s a different kind of defect, the analyzer demonstrates that we’ve literally assigned nullptr to a memory location and then proceeded to dereference it, bypassing all checks. It doesn’t rely on any assumptions, but it still connects multiple lines of code together: the one with assignment and the one with dereference.
Now, can this be a problem in some other cases? Yes, absolutely, and that’s a pretty fundamental problem with our analysis technique. Just because there’s a check in the code, doesn’t mean that the check is there for this specific purpose. It could be checking for a completely different possibility that the analyzer didn’t foresee, because the compiler doesn’t always have perfect information about the entire program it’s compiling. So, this could lead to false positives, and some of them could be pretty embarassing. And generally we want to do something about this problem. And we already do a lot of things to mitigate this problem.
But it’s also a very difficult problem. Even if we make a perfectly precise fix that correctly separates positives with a sufficient correct proof from positives where there’s not enough evidence that the code makes no sense, we may end up with a completely different static analyzer, which is not necessarily much more practical than what we already have. This is because a lot of these bad positives are actually not that bad; some of them are true positives (just because we can’t prove that the path exists, doesn’t mean it doesn’t), even if they’re false they still often encourage people to make their code easier to understand, document the underlying contracts with assertions, and some users find this valuable even when no underlying problem was ultimately found. Just like with dead code, a lot of these can be seen as just another, less severe problem that our tool highlights. Others are completely terrible though, so we definitely want some improvements in this area, and we definitely don’t want to make them more common than they already are.
So it’s probably going to be out of scope for this summer. Even if we see some false positives of this kind, we’ll probably ignore them, unless we have reasons to believe that our checker is affected by them disproportionally to the rest of the tool.
Please also see my reply in a nearby thread, it’s very relevant to this conversation.
A quick note about this one as well, I think this one’s also pretty difficult to fix! This basically goes way back to bug #43459. In fact, last year during GSoC our student @isuckatcs (often criticized for certain inaccuracy in the username) actually got pretty close to proposing a somewhat viable approach (link in the last comment in bug tracker) but we didn’t have nearly enough time to properly debug and test it. There’s also a somewhat related discussion in an old thread.
Hi @NoQ, I really appreciate your feedback. I can clearly see where this project is heading to.
I ran taint analysis on several well-known projects such as nginx and openssl, to find false positives. However, it did not raise any warnings from the projects I’ve tested so far.
Maybe this is because data flows from the default sources and sinks are not common, although they suggest unsafe data flow in the majority of cases. (Perhaps, another problem is that the units provided to the analyzer are too small compared to the entire project.)
If this project is to test taint analysis on real-world programs, providing custom source and sink sets for each project would be helpful. Are you considering customizing the source/sink sets for each target? If so, would selecting the right source and sink—where we can say that every data flow from the source to the sink is unsafe—be one of the challenge? Or, are we looking for any false positives of the data flow from the given source to sink?
I’m going to regenerate my own data - I have the ability to use resources at work to run analysis on a lot of code. But that will probably take a week or two, and even when I do, I probably won’t be able to share most of it with anybody, because it deals with proprietary code. I may be able to identify occasional open-source projects that are useful to us. But you definitely don’t have to wait on me. We’re going to be much more productive if you’re able to gather data yourself. So let’s see what we can do! By “we” I mean, I encourage you to try that, and I’ll probably do that myself as well, but I should probably focus on the proprietary code thing first, and we both probably have other stuff to do, so I guess whoever gets there first!
Our goal is, first and foremost, make sure the default, to unconfigured behavior is sane. We can add more taint sources specific to projects in order to test our tool better, but this use pattern (even if it’s ultimately more useful than the default) isn’t our primary goal. We need to make sure the analysis is correct first.
So I think it would be ideal if we found a few open-source projects that use the default taint sources a lot. It might help to search github for uses of these specific functions.
If this fails, we should try expanding the list of taint sources supported by default. There’s probably a lot of standard and “semi-standard” functions that a lot of people use, which we can safely cover.
In the worst case, we arguably don’t even need to be actual taint sources, we can still test that the rest of the machine works correctly even when completely random functions are annotated as taint sources. But of course it’d be much better if they were actual attacker-controlled taint sources.
To add to this, you might wonder how come I posted a project about solving some problems and I don’t even know what these problems are; and that’s like, a really valid question! While I’m 50% confident we’ll find quite a few bugs to fix (and if we don’t find nearly enough of them, we’ll still have a lot of fun implementing a couple new tasty checks or fighting other false negatives) I could totally have spent more time preparing some data ahead of time.
I didn’t want to, though. I think it’s actually a good idea to leave a project more exploratory and have less of “I tell you what to do and you do it”. As a maintainer, not all my work is programming; I have to do all sorts of work to make sure the project is healthy, which may involve programming, but sometimes it involves setting up testing infrastructure, sometimes it’s about gathering data and manual quality assurance, sometimes you document things, sometimes you talk to other users of the tool about their experience and process bug reports. Real-life problems do tend to be exploratory like that, and I’m kinda happy to offer realistic industrial experience during GSoC ^.^
Like, just because we don’t have any known false positives in the checker, doesn’t mean we’re free to enable it by default right this second; the other alternative is that we didn’t look hard enough. Static analysis is fascinating because the input data is so diverse that there’s almost never enough of it. Every codebase is unique, people write code differently, use different patterns and idioms. You can run analysis for days on a variety of projects and you don’t get nearly enough good signal from that. If course this doesn’t mean we can’t move forward without spending months on testing, but it also doesn’t mean we can give up and not test anything.
So objectively speaking, it’s absolutely clear that some work needs to be done. It may boil down to confirming that everything’s good, but the work of confirming is still non-trivial work that can easily consume 3 – 4 weeks.
Note that just because I haven’t provided all this data, doesn’t mean you have to gather it all by yourself before submitting your application. You probably shouldn’t prioritize this over your university! Ideally you should have some idea of what you’ll be doing during summer, but “gather data through this and that method” is a valid answer to that question. Of course if you do manage to gather some data ahead of time, it’d give you a much clearer idea of what would happen during the summer, and it may give you “competitive advantage” if multiple students apply (unfortunately GSoC has to be competitive like that, I always wish I could accept all the students, but we have to go through selection).
One thing I forgot to mention in the previous message, is that we have semi-official test suits! First of all, there’s official SATest.py docker runner that checks out some projects from github and applies the static analyzer to them inside a docker container, so that the results were the same on every host machine. Then, there’s also @Xazax-hun’s csa-testbench which doesn’t offer docker’s stability guarantees but still contains a lot of projects to look at. It might be that you’ll find more useful code to analyze in one of these, or even expand them with new projects you’ve found!