[Clang] Improve and stabilize the static analyzer's "taint analysis" checks

The Clang static analyzer comes with an experimental implementation of taint analysis, a security-oriented analysis technique built to warn the user about flow of attacker-controlled (“tainted”) data into sensitive functions that may behave in unexpected and dangerous ways if the attacker is able to forge the right input. The programmer can address such warnings by properly “sanitizing” the tainted data in order to eliminate these dangerous inputs. A common example of a problem that can be caught this way is SQL injections. A much simpler example, which is arguably much more relevant to users of Clang, is buffer overflow vulnerabilities caused by attacker-controlled numbers used as loop bounds while iterating over stack or heap arrays, or passed as arguments to low-level buffer manipulating functions such as memcpy().

Being a static symbolic execution engine, the static analyzer implements taint analysis by simply maintaining a list of “symbols” (named unknown numeric values) that were obtained from known taint sources during the symbolic simulation. Such symbols are then treated as potentially taking arbitrary concrete values, as opposed to the general case of taking an unknown subset of possible values. For example, division by a unchecked unknown value doesn’t necessarily warrant a division by zero warning, because it’s typically not known whether the value can be zero or not. However, division by an unchecked tainted value does immediately warrant a division by zero warning, because the attacker is free to pass zero as an input. Therefore the static analyzer’s taint infrastructure consists of several parts: there is a mechanism for keeping track of tainted symbols in the symbolic program state, there is a way to define new sources of taint, and a few path-sensitive checks were taught to consume taint information to emit additional warnings (like the division by zero checker), acting as taint “sinks” and defining checker-specific “sanitization” conditions.

The entire facility is flagged as experimental: it’s basically a proof-of-concept implementation. It’s likely that it can be made to work really well, but it needs to go through some quality control by running it on real-world source code, and a number of bugs need to be addressed, especially in individual checks, before we can declare it stable. Additionally, the tastiest check of them all – buffer overflow detection based on tainted loop bounds or size parameters – was never implemented. There is also a related check for array access with tainted index – which is, again, experimental; let’s see if we can declare this one stable as well!

Expected result: A number of taint-related checks either enabled by default for all users of the static analyzer, or available as opt-in for users who care about security. They’re confirmed to have low false positive rate on real-world code. Hopefully, the buffer overflow check is one of them.

Desirable skills: Intermediate C++ to be able to understand LLVM code. We’ll run our analysis on some plain C code as well. Some background in compilers or security is welcome but not strictly necessary.

Project size: Either medium or large.

Difficulty: Medium

Confirmed Mentor: Artem Dergachev, Gabor Horvath, Kristóf Umann, Ziqing Luo.

5 Likes

Greetings,

I’m interested in working on this project. I have some experience with C and C++ but none with compilers. Should I begin with the LLVM kaleidoscope tutorial?

Hi,
I’m interested in this project. I have some basic knowledge of compilers and have experience with C++. Although I don’t have experience with LLVM & Clang, I really want to dive in and I think this might be a good place to start. Do I need to check out the clang-tidy or is there anywhere that I should start?

Hi,
I am a final-year undergraduate majoring in cybersecurity and a prospective PhD student in software security and program analysis. It seems the project idea fits my research interests well. I have used CSA, Clang-Tidy, and LLVM IR in my previous work. I will look closer into the taint analysis checks.

Hi, I am interested in working on this project. I am a third-year CS student. I have some working experience in C/C++ but not in compilers. Could you suggest me where to get started?

@harsh @yichi170 @chinggg @Xenon1019 thank you for your interest!

I recommend diving directly into the clang static analyzer.

General LLVM tutorials such as Kaleidoscope are excellent but they don’t deal with anything that we’ll be working with here. In particular, the clang static analyzer doesn’t interact with LLVM itself; it interrupts the clang pipeline after AST to work directly with that, long before it is converted into unoptimized LLVM IR. Instead, the static analyzer builds its own “Clang CFG” which serves the similar purpose as LLVM IR but represents the original source code much more closely.

Good entry points for the Static Analyzer specifically are:

3 Likes

Another thing you can try is, try to actually use the static analyzer. Like, apply it to code you’re already familiar with, see if you like the results. Or at least to any code. While not a requirement by any means, GSoC is really at its best when you’re already a user of the project you’re contributing to.

3 Likes

Hey @NoQ , Shivam this side, been contributing to static analysis tools and in general compilers for a while, I was there in the previous gsoc if you remember :slight_smile: .

Btw, I was exploring the codebase for the Taint analysis in the static analyzer and found isTainted method, I doubt how much it is completed and can it cover all the taints? It seems to me that it cover a lot of ground in terms of detecting taints. It checks whether a given symbol or region is tainted, and traverses all symbols that a given symbol depends on to check if any are tainted. It also checks whether a memory region is tainted.
I think the buffer overflow detection on tainted loops could be implemented ? What’s your views on how much works still remains on the isTainted method or is it ready for start writing the checkers for most common taints checks, that will be useful ?

Hey, long time no see! Just like the last time, I believe it’s more important to focus on understanding the problems first, otherwise you’re putting the cart before the horse. A lot of the existing code is in the state of “the overall idea is probably correct but there may be bugs, because it wasn’t properly tested”. So we’ll most likely start by looking at some actual warnings on some actual non-toy code, decide whether they’re good or not, and if we discover any common sources of false positives, we’ll address them in order from most common to least common. I don’t know where exactly the fixes will go; they can be all over the place. We’ll probably have a few live debugging sessions over screen-sharing voice chat, to discover the root cause of problems we encounter, though of course it’d be great if you’re somewhat prepared to debug clang ahead of time.

So anyway, the best preparation you can do is, like I said, try to run the tool, try to judge the existing warnings (not even necessarily from the taint checker, could be any static analyzer warnings), understand what they generally look like, what to expect, what’s good about them, what’s bad about them, see whether you like them or not.

Try to get into this mindset, if you were a developer of a security-sensitive project, would these warnings be helpful to you? Is addressing these warnings a good use of your time, or is it better spent, say, addressing actual bug reports from live users? How can we make it easier? Eliminating false positives is one example of such activity, it cuts the cost of reading all the warnings, so it becomes more worthy of people’s time. Adding new high-quality checks is also good because it improves the benefit part, which may outweigh the cost. Another way to reduce the cost of using our tool is to make existing reports, even the correct ones, easier to read and understand; we’re now having a lovely discussion about it in ⚙ D144269 [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place - which is a patch very closely related to this project, something we’d probably have done if folks didn’t do that ahead of us!

1 Like

Hi.
I’m a Ph.D. student studying software security.
I am familiar with Clang/LLVM. I specifically have experience with LLVM IR passes, sanitizers, tests, frontend, and backend. I’ve tried using the taint analysis of the Clang static analyzer before and found it to be quite hard to use for non-toy programs.

If I want to take part in this project, do I need to submit a proposal in a separate thread?

Yes! This thread is for general questions. Once you have a very rough draft of your GSoC project proposal, you should open your own thread in the Static Analyzer subforum to discuss it with future mentors.

Other than various bugs that we want to address as part of this project, what you’ve probably discovered was that the analysis focuses on small portions of the program at a time. It may have a hard time connecting sources to sinks if they’re spread out across, say, multiple translation units, with many layers of abstraction between them.

My idea with this project is that we’ll ignore this problem, at least initially, but start with making sure that when the analysis does actually connect sources to sinks, it does so correctly.

Then if we have any time left, we can look into this problem. The solution most likely depends on the codebase we’re analyzing; for some codebases we’ll never do a good job without proper “whole-program” analysis (for which our underlying analysis technique is really not ideal), but for others it’s most likely a matter of providing annotations (clang attributes) to mark up taint sources and sinks in the code, so that our focus on small portions of the program was actually sufficient. So we can try to make sure that such annotations are suitable for at least some practical use cases.

1 Like

I totally agree with you. I believe a working taint analyzer (and any other compiler analysis) should be fast enough to be practical even though it may lose some, but the output should be correct.

Thank you for letting me know how to apply. I’ll open a thread there!

I made a large-ish reply to @juppytt’s new thread, which is probably relevant to everybody who wants to apply; I’m talking about our priorities with this project, namely that our primary priority is to solve false positives in the new checker, whereas addressing false negatives has lower priority.

There’s also an interesting piece of discussion in a nearby thread which is probably not relevant right away, but it could shed some light on future directions, and get you excited about the impact of this project!

2 Likes

Hello! :smiley:
I am a third-year undergraduate CS student, I am very interested in Clang/LLVM and compilers, I have written my own toy compiler based on a subset of C (using LLVM IR) and I am currently involved in a corporate internship writing a project on CSA Checkers related to taint analysis.
I would really like to take part in this project for GSoC 2023, but I don’t have experience contributing in a open-source community yet, is this necessary for the application?

No not really, it’s more like the opposite, the whole point of GSoC is to attract more new people to various open-source communities. So you’re like, exactly the target audience!

1 Like

I just remembered an inspiring link about how the Heartbleed exploit wouldn’t have happened if we did this project ten years ago: Using Static Analysis and Clang To Find Heartbleed | Trail of Bits Blog

Well, better late than never (shrug).

2 Likes

Hi!

I am one of the confirmed mentors for this project. Unfortunately, I have to announce that I’m withdrawing this year for personal reasons. I am not leaving the static analyzer, and I’m still interested in how this year is going to turn out, I just simply can’t take on additional responsibilities right now. Hopefully, I can still contribute to some extent!

This shouldn’t discourage anyone, @NoQ and @Xazax-hun has been mentoring static analyzer GSoC projects for many, many years now (including mine!), and hopefully many more to come!

Good luck with your proposals!

Best of luck with your personal reasons @Szelethus!! You’re always welcome to join our discussions ^.^

Two things.

  1. The deadline for applying to the project is coming up rapidly - it’s next Tuesday! (Please mind time zones!) By that point your final project proposal has to be uploaded to the GSoC website.

    • There’s still time to discuss it before uploading, and it’s still valuable to do so. We won’t have time to change much, but if there’s any large misunderstandings we’ll be able to correct them ahead of time, so it’s still much better than going in completely blind!
  2. @juppytt mentioned that he’s been having a hard time finding source code to demonstrate warnings of existing checkers. Which is something that we’ll eventually have to do in order to demonstrate that the checkers are indeed working correctly. In @juppytt’s thread I promised to gather some data of my own in order to help with that, so that to make sure that we’re even solving the right problems, and that we’re set up for success doing so.

    • So today I have the data! – I’ve ran the existing checkers on a bunch of proprietary code and observed a few hundred warnings.

    • I can’t share the precise data, or the warnings themselves, given that it’s gathered over proprietary code, but I can translate it into high-level conclusions such as “this checker is good”, “this checker has these problems”, etc.

      • So even if we don’t find enough open-source projects to which we can apply the checkers daily, we can still rely on my data.

        • I can repeat this process several times throughout summer as we improve the checkers, to gather the updated data.
        • Of course if you’re able to find some projects to gather data from, it could greatly improve our turnaround times!
    • I summarized my fresh data in a github issue that we can use to track our progress (or split up into smaller issues if necessary – there’s no formal process really): [Umbrella] Enabling taint analysis by default. · Issue #61826 · llvm/llvm-project · GitHub

      • The data has largely confirmed that my initial vision for this project (based on relatively old data) was accurate, so I think that no course-correction is necessary.

        • There are indeed several checkers that fail to confirm that the data is actually “unsanitized” before emitting the warning. We need to teach them how to confirm that.
        • There are also a few checkers that are simply looking for the wrong thing; they flag completely normal, perfectly valid code. We need to see if it’s easy to change them to look for something actually problematic, or they’ll have to stay in alpha until we figure this out.
        • All in all, there’s not that many bugs to fix. I’m pretty sure we’ll be able to cover all of them and even have time for implementing some new tasty checks!
    • I’m happy to discuss this data in more detail. It’d be great if you try to confirm that you understand the problems we will be facing, and there’s no harm in asking questions if you don’t understand it!

Hi everyone,

I’m interested in contributing to this project. I’m currently in the industry and I have over 6 years of Software Engineering experience, ranging from building the cryptocurrency, Mina Protocol, which uses zk-snarks to make blockchains constant-size, in OCaml and built security deployment tools for Apple in Scala. My experience in functional programming energized my interest in static analysis as I have a huge interest in making programs run safely without runtime errors.

Previously, I’ve built dynamic semantic analysis tools for JetBrain’s Python Data Science IDE called Data Spell. It essentially gave recommendations to users on how to analyze their data analysis code based off the AST of their code and the current execution context of the code.

Now, I would love to make a contribution to implement various static analysis checks using “taint analysis” for the Clang repo. I noticed that there is huge amounts of interest for this project (which is amazing!). Would it be possible to have multiple people work on this project? I can even start contributing to this project right now (like 20+ hours per week) well before GSoC starts.