[Clang Static Analyzer][GSoC 2025] Teach the Clang Static Analyzer to understand lifetime annotations

The Clang Static Analyzer (CSA) can already find a wide range of temporal memory errors. These checks often have hardcoded knowledge about the behavior of some APIs. For example, the cplusplus.InnerPointer checker knows
the semantics of std::string::data. The Clang community introduced some lifetime annotations including [[clang::lifetimebound]] and [[clang::lifetime_capture_by(X)]] and made many improvements to Clang’s default warnings. Unfortunately, the compiler’s warnings only do statement local analysis. The CSA is capable of advanced inter-procedural analysis. Generalizing the existing checks like cplusplus.InnerPointer could enable the analyzer to find even more errors in annotated code. This can become even more impactful once the standard library gets annotated.

Expected result:

  • Identify the checks that can benefit from the [[clang::lifetimebound]] and [[clang::lifetime_capture_by(X)]] annotations.
  • Extend those checks to support these annotations.
  • Make sure the generated bug reports are high quality, the diagnostics properly explain how the analyzer took these annotations into account.
  • Validate the results on real world projects.
  • Potentially warn about faulty annotations (stretch goal).

Skills:
Intermediate C++ programming skills and familiarity with basic compiler design concepts are required. Prior experience with Clang or CSA programming is a big plus, but willingness to learn is also a possibility.

2 Likes

Adding some folks just in case some people are interested in co-mentoring or following in general.

cc, @steakhal, @NoQ, @hokein, @usx95, @gribozavr, @isuckatcs, @DonatNagyE, @Szelethus, @rnkovacs

I’m interested in co-mentouring. I think this would be a good opportunity for me to get a sneak peak into thr process of gsoc.

The subject looks reasonable and impactful to me.

1 Like

HI,i am interested in this project and could you please share some documentations or issues related to this.

Wonderful! Thanks a lot, added you as a mentor!

For any applicants I recommend to do the following:

  • Familiarize yourself with the attributes and what problems they solve. You can find some documentation about them here.
  • Familiarize yourself with the CSA. I recommend taking a loog at the official documentation but also recommend taking a look at some talks about the CSA. You can find some here.
  • When you write a proposal make sure to demonstrate you have deep understanding of the problems we want the CSA to find (using your own examples that are not just copied from the documentation)
  • Try to make some simple contributions to LLVM before you submit your proposal to demonstrate you can work with the codebase (can compile, run tests, format code, open PRs etc.)
  • Have a plan with a timeline how would you add this feature to the CSA. It is OK if your plan does not entirely match what we had in mind but the point is to demonstrate that you can come up with some ideas independently.

I also recommend to try to get feedback about your proposal before submitting.

1 Like

Hi, I’m interested in working on this project. I have relevant experience and would love to contribute. Is this the right place to discuss it?

Hi, yes, you’re at the right place.

Hi Gabor, I really want to start making contributions to LLVM, I was wondering if you could suggest any specific types of open issues suitable for beginners? Are there particular tags I should look for? And how many contributions would be a good amount to demonstrate my involvement? If you could suggest some specific open problems I can work on it would be so helpful. Thank you!

Hi ElioCheng,

I’m not Gabor, but I’m also one of the mentors on the project, so let me add my two cents here.

You can see all the open issues on GitHub. The issues that the community finds good for beginner are usually labelled as good first issue. Static analyzer relates issues also have their own clang:static analyzer label.

I would suggest ignoring the tags though. Just pick whatever issue you find interesting and start working on it if that’s your ambition.

There are also other ways of contribution besides fixing bugs, e.g.: while you are reading through the documentation, you find a typo and submit a fix for that.

And how many contributions would be a good amount to demonstrate my involvement?

Having prior submitted patches is not a must but it definitely helps. If you are intersted specifically in this project, I suggest focusing your efforts on it and not splitting it on multiple unrelated things. I mean, don’t start working on e.g.: X86 backend issues, because this project is for the Clang Static Analyzer.

If you could suggest some specific open problems I can work on it would be so helpful.

A specific open problem to work on is how you would teach the CSA to understand lifetime annotations :smiley:. In all seriousness, just build the analyzer, start playing with it to understand how it works behind the scene. Look into the checks, understand how they work, think about how you would add lifetime annotations to them, etc.

Hi,
I would like to contribute to this proposal. Currently I am reading up on Checker dev manual. Looking at the documentation left me a bit confused . From what I understand lifetimebound allows me to specify which parameter of a function cannot be a temporary object or an rvalue. And lifetime_capture_by(x) allows the same but enables me to express that, what x would capture the reference to will be destroyed and result in a dangling pointer. Is my understanding correct? Can you point me to an example where lifetime_capture_by(x) is used ?

Thanks for expressing interest!

Strictly speaking, this is not true. The argument can be a temporary, it just has to outlive the returned object.

For example:

string_view to_view(const string& str [[clang::lifetimebound]]);

cout << to_view(string("Hello"));

This example has a temporary argument, but it is perfectly fine since we never use the result after the temporary was destroyed. And the compiler will not emit a warning for this scenario.

One good strategy to look for examples is to grep the test folder of Clang for the name of the annotation. Among those test files you will find plenty of code snippets that use these annotations.

Let me know if you still have questions after looking at those.

If you find a way to better explain these annotations feel free to open a PR to improve the documentation.

Thanks for correcting.

I looked through the examples for lifetime_capture_by(x) and from my understanding it allows us to bind a parameter’s lifetime to another parameter x. The annotated parameter must outlive any capturing entity.

Though I don’t understand the meaning of lifetime_capture_by(unknown) or lifetime_capture_by(global). From my understanding from above, the annotated parameter should be in present in the global scope in case of global argument. But that does not seem to follow.

set<string_view> s;
void addToSet(string_view a [[clang::lifetime_capture_by(global)]]) {
  s.insert(a);
}

when I add this

int main () {
  string h = "Hello, world";
  addToSet(string_view(h))
}

I get no warnings even though h is local to main

Can you clarify what does lifetime_capture_by(global) do? The examples in the test directory didn’t really help regarding this they are just checks for parameter names and other example doesn’t help either.

Same with lifetime_capture_by(unknown), the examples of this lie in same area as global.

Yup! The analysis the compiler is doing for these lifetime annotations is really basic. It is a tradeoff so it can be on by default during compilation. The clang static analyzer on the other hand can do a deeper analysis but it is typically slower than compilation.

The goal of this project is to catch errors in the clang static analyzer that are not caught by the compiler. The example you showed is one of those that could potentially be targeted.

So long story short, there is a gap between the information expressed by the annotations and the problems that are diagnosed by the compiler. The goal of this project is to narrow this gap when people are using the clang static analyzer.

I wanted some clarification regarding the extension of checkers. In case of cplusplus-innerpointer are we also aiming to extend the said checker to include checks for containers other than std::string like std::vector etc?

I also wanted an opinion on some modifications that I had in mind. In llvm-project/clang/lib/StaticAnalyzer/Checkers/InnerPointerChecker.cpp the method

void InnerPointerChecker::markPtrSymbolsReleased(const CallEvent &Call, ProgramStateRef State, const MemRegion *MR, CheckerContext &C)

could be annotated as follows

void InnerPointerChecker::markPtrSymbolsReleased(const CallEvent &Call [[clang::lifetime_capture_by(State)]], ProgramStateRef State, const MemRegion *MR, CheckerContext &C)

I made use of dependency graph to decide on the annotations.

Is this a valid method to use? Are there any other methods that you can recommend?

Hello I am also interested in this project!

As for “teaching the CSA”, where in particular can I go to see where CSA is already finding temporal memory errors?

Also, how do the annotations handle something like loops or returning a new/malloc’d variable from a function? I assume for cases like these we’d have to start doing something like dataflow analysis, or something more advanced.

Hello!

You can see the list of checkers the CSA has here

The CSA is designed to find concrete execution paths that lead to bugs, and for that it uses symbolic execution. We probably don’t want to do dataflow analysis, or any other kind of technique in this project.
.

The reason why this check is limited to a predetermined list of types is that the information required to find these memory errors is not present in unannotated C++ code, so this knowledge is/was hardcoded. If/when the code is annotated, we should be able to extend this check to arbitrary types as long as they are properly annotated.

All that being said, it is possible that modeling this for an arbitrary type will be really different from modeling this for a types like std::string that has known semantics.

I don’t think the State captures a reference to the CallEvent. The state consists of a program point (akin to an instruction counter at runtime), the bindings (mapping from memory locations to values), the constraints (relationships between the values) and custom checker state. (This is of course a bit of a simplification). CallEvent is a high-level facility to make querying the state a bit more convenient and provides some abstractions over all the different kinds of calls Clang has (like Obj-C methods, operators, free functions, methods etc).

Could you elaborate on how did you make this graph and what is on it?

After reading your description I agree state does not capture a reference to the CallEvent. Though the converse might be true.

Since CallEvent makes querying the state easier then it stands to reason to reason that state should outlive the CallEvent.

I kept the graph without any information regarding the semantics of the parameters and lifetimes. The bold text represents parameters that may benefit from a lifetime contract, from top to bottom is the flow of time during the execution, italic text indicates redefinition of a parameter. All other are variables from the function.
The graph just shows on which entity does variables and parameters depend on.

I have another question regarding annotating, we are only considering annotating the parameters right? Not the variables inside the function?

Hi Everyone!

I find this project really intriguing. I am currently trying to find the circumstances in which the CSA does not give warnings about lifetimes.

For instance I have tested the example in the documentation for the lifetimebound annotation. When I just ran clang normally, i.e. clang main.cpp, the warning was emitted as expected. I commented out the annotations and reran the previous command. This time no warning was emitted, which is not too surprising I guess. With the annotations still commented out I invoked the static analyzer with all of the checkers enabled and it did not give warnings. I found this more interesting, as I thought it would have been able to find the issue in that relatively simple case, even without annotations. I enabled the debug checker as well, just to make sure that it did not silently filter out the warning out as a false positive. To view the debug output I used the exploded-graph-rewriter.py script.

At the moment I am still gathering examples and going through the provided materials. I will submit my findings in a formal proposal on the weekend.

- Gábor Tóthvári

Hi!

I think I have found an interesting example, that illustrates an area of improvement for the analyzers annotation handling. The following program contains an obvious use after free bug, however when I run clang++ --analyze main.cpp for this file, I get no such warning.

#include <string>
#include <vector>
#include <algorithm>
#include <random>

const std::string& pick_randomly(
    const std::string *choice0 [[clang::lifetimebound]], 
    const std::string *choice1 [[clang::lifetimebound]]
) {

    std::vector<const std::string*> v{};
    v.push_back(choice0);
    v.push_back(choice1);

    std::random_device rd;
    std::mt19937 g(rd());
    std::shuffle(v.begin(), v.end(), g);

    return *v[0];
}

int main() {

  const std::string *choice0 = new std::string{"foo"};
  const std::string *choice1 = new std::string{"bar"};

  const std::string &val3 = pick_randomly(choice0, choice1);

  delete choice0;
  delete choice1;

  const char *cstr = val3.c_str();  

  return 0;
}

The pick_randomly function takes the two pointers to heap allocated strings, shuffles them using a std::vector, then returns the one in the first position. At the moment certain pieces of code, like the shuffle, can hide from the analyzer that the parameters might be returned from the function. Because of this the use after free is not reported at the c_str() call.

If the vector shuffle is replaced with just a ternary operator and std::rand the analyzer finds the problem.

const std::string& pick_randomly(
    const std::string *choice0 [[clang::lifetimebound]], 
    const std::string *choice1 [[clang::lifetimebound]]
) {
    return std::rand() % 2 ? *choice1 : *choice0;
}

My initial idea for a solution is that the lifetimebound annotation should initiate a path of analysis, where it is assumed that the function return value refers to the same object as one of the parameters. For example let’s say that it assumes for one path that val3 is essentially the same as choice0. Then once choice0 is deleted, the analyzer would know that val3 also became invalid and therefore val3.c_str() is use after free. Naturally, when there are multiple parameters it could do this for each parameter.

I am more than interested to hear your feedback about this example.

- Gábor Tóthvári