[RFC] Summary Based Analysis Prototype

isuckatcs · April 20, 2025, 4:35pm

Motivation

After the round tables at EuroLLVM 2025, we had a chat with @Xazax-hun about implementing a prototype for the sumamry based approach, and seeing how it affects the quality of the bug reports, first, in the Clang Static Analyzer.

Implementation Details

I imagine a similar implementation that we see with the compilation database. Instead of the build system emitting a compile_commands.json file though, we add a new flag to the compiler, e.g.: -emit-summary and let the compiler create a <source-file>-summary.json file.

The various clang tools could then consume this similar to how the compile_commands.json is consumed. Except, instead of accepting only one summary, the tools could take multiple ones, that will get merged upon processing.

The Map Phase

During the round tables, a MapReduce like infrastructure was proposed. At first glance, I would imagine the Map part to happen inside Sema as a FunctionDecl is processed.

We would have a class, e.g.: Summarizer that runs a series of analysis passes on the function, after it has been resolved by Sema and it’s body is known. The result is immediately appended to the <source-file>-summary.json file. This way, by the time clang finished processing the AST, we will also have the summary if it was requested.

The Summary

JSON seems to be a convenient format for the summaries, as in the LLVM project, we can already work with it. IIUC, we need to store at least 3 things about a function. It’s unique identifier, a set of attributes we could infer, and the other functions it calls.

Take the following snippet as an example, and assume we want to figure out if the function writes to global variables or not.

void a() {
  int x = 0;
  b();
}

The function a() doesn’t write to global variables if it doesn’t contain a direct write to a global variable and b() doesn’t write to global variables either. On the other hand b() might also be defined in a separate translation unit, so we only perform inter-procedural analysis, i.e.: check for direct writes only. As a result, we end up with the following summary.

[
  {
    "id": "a",
    "attrs": ["NO_WRITE_GLOBAL"],
    "deps:": ["b"]
  }
]

The b() function might look like this:

void b() {
  global = 1;
}

In this case, we end up with this summary:

[
  {
    "id": "b",
    "attrs": [],
    "deps:": []
  }
]

As for identifying the functions, using their names is not an ideal option, as there can be multiple functions with the same name (overloads, static functions local to source files, etc.), so we have a couple of options here. USR, MangledName, ODRHash could all be viable options, though USR seems to be the most convenient, as it has a way to deal with static functions in different translation units with the same signature.

The Reduce Phase

Reducing the summaries can either happen in a different class, or we could reuse the Summarizer, so that it can both summarize a function and add attributes to it based on existing summaries.

In either case, we would parse all the summaries given to the various clang tools, and for each function, to figure out it’s actual attributes, we could run the following algorithm:

attrs calculateAttrs(FunctionSummary F) {
  attributes = F.attrs;
  
  for(D in F.deps)
    attributes = intersect(attributes, calculateAttrs(getSummray(D)));

  return attributes;
}

In other words, we probably only want to keep those attributes that also apply to every function the visited function calls. We would also use some form of caching, so that the same function is not evaluated multiple times. In case of cycles, we can probably just return before the for loop.

Depending on how expensive this is, the algorithm could also be run while Sema processes the source file and sees a function call, so that the emitted warning can also use the so far existing summary. This might be difficult though, as in C++ a function might be defined only after it is called, and in those cases we don’t know anything about it’s body as we haven’t seen it yet.

void foo();

void bar() {
  foo(); // when we see this, we couldn't have analyzed the body of 'foo()'
}

void foo() {
}

As for using these attributes, I see two approaches. The first is to add them directly to a FunctionDecl in the AST, so they are available automatically everywhere, but this might require modifying the AST. The other approach is to store them in a different object, so the AST is not modified, but with this approach every callsite, that wants to read these attributes needs to be modified.

Implementation Plan

With @Xazax-hun we discussed starting small first, and only implement reasoning about writing to globals or not, because it is a relatively simple analysis, and the CSA could use this information directly for conservative evaluation.

This also requires implementing the generation and the consumption the summaries. Once we have a working prototype and see what blockers we have, how expensive this approach is, etc. we can think about the additional steps, such as build system integration.

I wonder if you have any thoughts on this, or you might see some cases, we’re missing here, or know better, more efficient approaches. Any feedback is appreciated.

cc @Xazax-hun @steakhal @devincoughlin @ilya-biryukov

Xazax-hun · April 20, 2025, 5:48pm

This is a good first step! I have some questions.

Where would you put this summary json file? Next to the source file? What if that source file is compiled multiple times with different command line options. Next to the object file? What if the compiler invocation does not produce an object file? What should be the behavior when multiple source files are passed to the same compiler invocation? Should the user be able to specify the path?

What would be the model for templates? Do we analyse the instantiations? Do we plan to have a different summary for each instantiation or de we attempt to do some deduplication of they happen to be the same? Do we want to collect summaries for all entities? Even the ones in system headers? Do we want users to be able to inject summaries manually? Where do we want to merge them?

It is convenient such that summaries can be easily inspected, diffed and so on. But it might not be the most efficient. I wonder if it scales well to large projects, or if we should use a binary format. I am OK starting with a textual representation and switching over to binary later if that turns out to be desirable.

I think we might need a bit more information. Sometimes, we might not be able to resolve all the function calls because they are through virtual functions or function pointers. I think that information is crucial in the summaries.

I think this is a good start to build out the infrastructure but we want to support more complex attributes in the future. Specifically, inferring noescape would be tremendously useful. Bot noescape is not a function attribute, but a parameter attribute. So we might want to be able to include parameter and maybe return value specific information as well. A simple array of attributes might not cut it.

I think the reduce phase might be a bit more complex than this. The call graph might have unknown dependencies (calls without summaries, calls through function pointers), and it might have multiple cycles (although a solution was suggested for cycles).

This is fine, if we already have summaries available for some reason we do have some information.

Some of the summaries we want to infer already exist as attributes in the language. In those cases, I think it would be really beneficial to add them to the AST.

Some other general observations:

We might want to extend this along many dimensions in the future. E.g., does it make sense to have type summaries? Maybe. So the very least we need to version the summary format.
For maximum precision, different attributes might have different dependencies. E.g., writes global can depend on all of the called functions. But noescape would only depend on the function calls that take the parameter for which we do the inference for as an argument. Of course, over-approximating might be OK in the first version.

The kind of information we might want to infer includes the bounds safety annotations and lifetimebound in the future. I wonder if @jankorous or @rapidsna have any requirements for those.

isuckatcs · April 20, 2025, 7:20pm

Perhaps we could require a path to a directory in which the summaries will be put, so the option becomes --emit-summary-to=/path/to/summary-dir.

The internal structure of the given dir could match the structure of the project, so if we compile project-root/lib/foo.cpp, it’s summary will end up being at summary-dir/lib/foo-summary.json.

We could also mimic what --save-temps does, so the summary is emitted in the output directory specified by the -o parameter. If -o is missing the summary goes next to the source file.

I was thinking about analyzing the instantiations. The idea is that different template instances have different labels, so, when called, each instance has a different unique identifier.

We can implement deduplication, though I think that’s way beyond the prototyping phase and we should only cross this bridge once we get there. First we should see something working and then, once we have a better understanding of it, we can start to think about how to optimize it. In case of deduplication and JSON though, it would probably look like that the "id" field is not just one id, but an array of identifiers. However to do this, we need to have the full summary first, and then run an additional pass on it, that finds and collapses the duplicates.

We can probably ignore system headers and manually assume a certain set of attributes for them to save space and time. If the analyses we perform are relatively quick, we can let them run though. The only real concern is the size of the summary then.

I don’t have a strong oppinion on if we want users to inject anything manually, or not. In case of a compilation database file, anyone can edit it manually, so they can add or remove flags. I found myself doing that once or twice actually. In case of summaries, users might edit it at their own risk.

Merging the summaries happens during the “Reduce” phase.

Sure, first let’s get it working, then think about optimizing it. We also need to figure out if we want users to manually edit it, as in case of a binary format that will be very difficult. Like I said, I’m totally neutral in this and not leaning towards either side.

A format that’s simple to generate might also come in handy for interopability. Like we call a function from a different language, and the summary of that function shall be generated by a different compiler.

Hmm, these informations are only usable by the CSA if it knows the type of the object, or where the pointer points to, right? Clang-Tidy for example couldn’t take advantage of these, because it doesn’t know what address the pointers hold, right?

For the prototype, I would just remove every attribute from the function where we see these and say, we cannot reason about it. In the future, we can figure out what to do with these. ATM, I’m not even sure how we can reason about them properly. Maybe we could intersect the attributes of every possible functions, the pointers can point to, but even then it’s very likely we end up with an empty set.

JSON is flexible with this. The "attrs" array is for the function itself. Later we could add a "param_attrs" and a "return_attrs" field as well, although I think the return value specific information should still be a function attribute. If the return specific information belongs to the return type, it should be a type attribute instead.

Calls without summaries or calls through function pointers are something we can’t reason about, so in that case we just intersect with an empty attribute set. Calls through function pointers could be solved in the future though, as mentioned above.

What I’m not sure about is that initially a fixpoint algorithm was suggested and I’m not seeing why we need that, so I’m probably missing something.

This is fine. Whatever format we end up using, JSON, YAML, custom binary format, we can just add a version field to the beginning.

We could actually have the summaries grouped by attributes, so each attribute could have it’s own list of dependencies. They are produced by different analysis passes after all.

[
  {
    "id": "foo",
    "attrs": [
      {
         "id": "NO_WRITE_GLOBAL",
         "deps": ["bar", "baz"]
      },
      {
         "id": "NOESCAPE",
         "deps": ["bar"]
      }
    ],
  }
]

Xazax-hun · April 20, 2025, 8:58pm

This does not solve the issue when a source file is compiled multiple times, this would be a bit racy. We could include a compiler command hash in the json name to avoid races. At the consumption side, we could just drop summaries that are not identical between the different compiler invocations for the same function.

Sounds good for the prototype. That being said, dropping summaries for all functions that have a virtual call might be a bit aggressive. I think we might want to do something more sophisticated about this in the future. But I agree, this is not something we need to address initially.

In case we anticipate different kind of attributes to be stored under different keys, I think we should call it "function_attrs".

I think this depends on the problem. Imagine, we want to infer lifetimebound:

int *f(int *a [[clang::lifetimebound]], int *b, int *c, int *d) {
  if (rand() % 2) return a;

  return f(b, c, d, a);
}

Here, the summary might have the information that the first argument is lifetimebound. But because of the call f(b, c, d, a) we could infer that the second parameter is also lifetimebound. However, because of the f(b, c, d, a) call, now we know that the third parameter is lifetimebound. And so on, we can deduce all of them are lifetime bound. The point is, once we discovered that a function parameter is lifetime bound, we need to go back and reprocess all the summaries that depends on this function because we can now potentially infer additional cases of lifetimebound parameters. In a cycle (this example was a cycle of one), we might need to traverse the cycle multiple times to propagate all the information.

Xazax-hun · April 20, 2025, 9:41pm

isuckatcs:

[
  {
    "id": "foo",
    "attrs": [
      {
         "id": "NO_WRITE_GLOBAL",
         "deps": ["bar", "baz"]
      },
      {
         "id": "NOESCAPE",
         "deps": ["bar"]
      }
    ],
  }
]

Actually, the more I think about it the more I wonder if we should actually not have any “dependencies” between the different attributes. Maybe what we actually want is more of a semantic description of a function body. Something like:

[
  {
    "id": "foo",
    "called": [
        { "id": "bar", "flows": { 1: 1, 2: 1 } } // How params are propagated to calls
    ]
    "params": [ 1: ["noescape", "lifetimebound" ],
    "side-effects": ["no-global-write", "no-alloc"]
  }
]

For:

int *f(int *p [[clang::lifetimebound]] [[clang::noescape]]) {
  return bar(p, p);
}

isuckatcs · April 21, 2025, 12:25am

I probably don’t understand the exact use-case that we want to solve here. Do you mean that during building a large project (e.g.: ninja clang), the same source file can be compiled multiple times with different compiler options? Or do you mean that we might build the project, add some new compiler flags and then rebuild it?

Xazax-hun:

I think this depends on the problem. Imagine, we want to infer lifetimebound:
int *f(int *a [[clang::lifetimebound]], int *b, int *c, int *d) {
  if (rand() % 2) return a;

  return f(b, c, d, a);
}
Here, the summary might have the information that the first argument is lifetimebound. But because of the call f(b, c, d, a) we could infer that the second parameter is also lifetimebound. However, because of the f(b, c, d, a) call, now we know that the third parameter is lifetimebound. And so on, we can deduce all of them are lifetime bound. The point is, once we discovered that a function parameter is lifetime bound, we need to go back and reprocess all the summaries that depends on this function because we can now potentially infer additional cases of lifetimebound parameters. In a cycle (this example was a cycle of one), we might need to traverse the cycle multiple times to propagate all the information.

This is not the merging of the summaries though, right? This should be done by a “mapper” pass, and the summary it emits should contain the information that every parameter of f() is lifetimebound.

In fact this information could also be processed by Sema or a lifetimebound-annotator CT check to warn that the parameter is actually lifetimebound, but not marked as such. Also we could probably even do this now, as it doesn’t require summaries, unless it’s cross-TU function calls.

As for merging this property across translation units, I have no idea how we can do that. Imagine the following case:

// A.cpp
int *b(int *);

int *a(int *x) {
  return b(x);
}

// B.cpp
int *b(int *y) {
  return y;
}

So, y could actually be annotated as lifetimebound, right? That would make x also lifetimebound, right?

Since these are in different source files, they were compiled with separate compiler invocations, so we have two different summary files.

// A-summary.json
[
  {
    "id": "a",
    "called": ["b"]
    "params": [[]], // The index of the element in the array corresponds to the index of the parameter.
  }
]

// B-summary.json
[
  {
    "id": "b",
    "called": []
    "params": [["lifetimebound"]], // The index of the element in the array corresponds to the index of the parameter.
  }
]

To figure out that x is also lifetimebound, we would need to analyze a() again such that B-summary.json is known to the analyzer. During the merging phase we only know the properties, we don’t know anything about the function bodies, and don’t have the AST-s either. The summary might also contains functions from other languages, we interop with, and in those cases we can’t even build an AST.

In fact, in this particular case, we would need to do some cross-TU fixpoint analysis, IIUC. Like, we emit a summary for each TU, recompile each TU with the now known summaries, and repeat until the summaries change.

Xazax-hun:

Actually, the more I think about it the more I wonder if we should actually not have any “dependencies” between the different attributes. Maybe what we actually want is more of a semantic description of a function body. Something like:
[
  {
    "id": "foo",
    "called": [
        { "id": "bar", "flows": { 1: 1, 2: 1 } } // How params are propagated to calls
    ]
    "params": [ 1: ["noescape", "lifetimebound" ],
    "side-effects": ["no-global-write", "no-alloc"]
  }
]
For:
int *f(int *p [[clang::lifetimebound]] [[clang::noescape]]) {
  return bar(p, p);
}

I think this will get very complicated quickly with more complex functions.

int * complex(int *p [[clang::lifetimebound]]) {
  p = foo(p);
  return bar(p, p);
}

Here we can’t write anything to "flow", because we don’t know what p is after the assignment, right?

int * complex(int *p [[clang::lifetimebound]]) {
  if(rand() % 2) 
    return bar(nullptr, p);

  return bar(p, nullptr);
}

This other function however will probably produce two different propagations. Or we just want to treat it as { "id": "bar", "flows": { 1: 1, 2: 1 } }? Isn’t this the same as bar(p, p)? Can we actually treat them the same way?

At this point, I don’t think annotating the parameters while merging the summaries is possible, at least not without making it very expensive and overcomplicated. If interopability is involved, it’s even more complicated. We need to reach a fixpoint across multiple translation units, not just one, an apparently these might be written in different languages.

IIUC, in Rust something similar only works, because marking borrows is compulsory. If something is not marked as borrowed, owenership transfer is assumed, so technically the compiler doesn’t INFER lifetime annotations, but already KNOWS them, and ensures that what the user writes is correct according to them.

Xazax-hun · April 21, 2025, 9:06am

Yes.

Since here the cycle is of length 1, we technically could. But it is easy to split this up in two or more functions in different TUs, so we no longer can.

int *g(int *a, int *b, int *c, int *d);
int *f(int *a [[clang::lifetimebound]], int *b, int *c, int *d) {
  if (rand() % 2) return a;

  return g(b, c, d, a);
}

int *g(int *a, int *b, int *c, int *d) {
  return f(a, b, c, d);
}

That is the exact motivation why we need summaries. Inferring some of these annotations are inherently a whole program analysis problem and we somehow need to handle the cross-TU calls.

Not necessarily. This all depends on what we put in the summaries. If we want to be able to infer lifetimebound annotations across function calls, we need to persist the information we need for this in the summaries. This is something the "flows" part would permit. The idea is, in the Map phase the analysis does have access to the AST, so it can persist all the info we would need later on in the reduce phase, in your example that would be the fact that the first argument of a flows into the first argument of b. This fact combined with the fact that b’s first argument is lifetimebound will provide all the info we need to mark the first parameter of a lifetimebound.

That is really expensive and that is something we hopefully can avoid if we have the right summaries.

That is fine. As soon as we cannot reason about a value we just do not persist that info into the summaries and we lose some coverage. But we do not infer the wrong facts.

Exactly, summaries will always lose some information. For the purposes of lifetimebound, I think it does not matter if we have bar(p, p), or bar(p, nullptr) and bar(nullptr, p). The idea is to define the summaries in a way that it abstracts away the details we do not care about and only capture the important bits. This is the key to make the analysis more efficient.

isuckatcs · April 21, 2025, 11:59am

So, I think the key takeaway here is that the different analysis passes need to emit different information in the summaries, and need a different logic to merge them. For example writing to globals doesn’t require a fixpoint, but annotation inference does.

We also want to be able to add new analysis passes efficiently, and remove or disable them as well, so we need a flexible summary format that can handle this.

I imagine that each analysis can also serialize and deserialize it’s result, so the logic is completely separated. This is similar to how MLIR works, though we probably don’t want to go down this rabbit hole for now.

class SummaryBasedAnalysis {
  virtual void run() = 0;
  virtual void summarise() = 0;
  virtual void parse() = 0;
  virtual bool reduce() = 0; // true - summary changed, false - summary didn't change
}

Processing them would look something like this below then. I’m not sure about the loop order, we can make it such that that summaries are only iterated once and the passes are iterated multiple times.

void processSummaries() {
  for(analysis : analyses) {
    for(summary : summaries) {
      analysis.parse(summary);
    }

    while(analysis.reduce());
  }
}

Basically, each analysis reads all of it’s information and then merges them. This allows us to add, remove or modify the certain analysis passes separately without affecting anything else.

With a semantic description instead, we wouldn’t be able to edit the different passes separately, because if one of them modifies the layout of the semantic description, other passes will be affected as well.

Also, this way the annotation inference becomes a “How to design my own analysis such that it infers annotations?” question instead of a “How we should modify the framework such that we can also infer annotations?” question. In other words, annotation inference only relies on itself and not on how we design the framework.

isuckatcs · April 21, 2025, 12:08pm

Are you sure that in this case the resulting object files are also placed next to each other? In that case they would overwrite each other, so the linker might link wrong objects, right?

Also if we have separate files because of the different compiler options, I don’t think we need to drop them completely. We can still merge them, and we preserve the information that was present with both invocations.

Xazax-hun · April 21, 2025, 1:03pm

That is a big tradeoff. This would mean that the number of iterations through the summaries would scale linearly with the number of analyses. It would be nice if we could update all analyses with the same traversal.

The summarised semantic information should not change over time. If we summarised that a function parameter’s value flows into a call, that is a fact about the body of the function. This fact should never change. It was true when we did the summarisation and will remain true throughout the analysis.

Also, it would be really bad if we needed to duplicate the same information (like dataflow) across all the different analyses.

I thought the proposal was to match the source directory layout, but maybe I missed that part. Yeah, it is unlikely that the object files would end up in the same output directory in that scenario (but it is not impossible to set the build up in a way that they will end up in the same directory but different name).

This is what I meant. Drop the information where the two differs.

isuckatcs · April 21, 2025, 3:01pm

Here I meant that we might decide that we need a new property later, so we might end up adding a new field to the summary. If the passes are parsing the summary, they will need to be changed.

We could fix a set of rules about how the certain properties are merged together instead of letting the analyses decide themselves then.

void reduce(Summary s) {
  for(f : s.functions) {
    for(c : f.called)
      f.sideEffects.intersectWith(s.find(c).sideEffects);

    for(p : f.params)
       // Rules for different parameters...
  }
}

The problem is that different goals with parameter annotations require different handling. If we pass a noescape parameter to a different function, where the parameter of the function that receives it is not marked as noescape, the parameter in the caller cannot be marked as noescape either, so we want to intersect here.

If we want to infere the attributes however, based on the lifetimebound example, we will say that parameter p is passed to a function as a lifetimebound argument, so p should also be lifetimebound in the caller (though only if the value returned by that function is also returned by the caller), so in this case we want a union.

I still feel like merging the summaries and infering annotations are 2 separate problems. Maybe we could reduce the summaries first, when we intersect everything, and then post process the reduced summary, so we can figure out the annotations.

Also, how about separating the serialization/deserialization from the analyses? This would leave us with the following architecture.

class Summary {
  void serialize(const char *);
  void parse(const char *);
  void merge(Summary);
  void reduce();
};

class SummaryBasedAnalysis {
  virtual void run(Summary &) = 0;
}

The logic we perform would look like this:

void performSummaryBasedAnalysis() {
  Summary summary;
  
  for(analysis : analyses)
    analysis.run(summary);
  
  summary.serialize("path/to/summary.json");
}

void readSummaries() {
  Summary summary;
  
  // This is just merging like concatenating the content of the JSONs together.
  for(path : pathsToJsons)
    summary.merge(Summary().parse(path));

  summary.reduce();

  // Give the reduced summary to Sema, so it can use additional information
  // while building the AST.
}

This way we also get separate control over the summary format and the analyses. We can change the internals, represetnation, etc. of the summary as long as the API remains the same. We can also change how each analysis pass works without affecting anything else.

mshockwave · April 21, 2025, 4:37pm

isuckatcs:

attrs calculateAttrs(FunctionSummary F) {
  attributes = F.attrs;
  
  for(D in F.deps)
    attributes = intersect(attributes, calculateAttrs(getSummray(D)));

  return attributes;
}
In other words, we probably only want to keep those attributes that also apply to every function the visited function calls.

I might misunderstand what you meant here, but how would we guarantee a fixed point if the join function is a set intersect? Intuitively, it’ll be easier to guarantee a fixed point if it’s a set union.

isuckatcs · April 21, 2025, 5:09pm

Sure, this is why I said previously that I don’t quite understand why we need a fixpoint in this case. Then @Xazax-hun showed an example of a fixpoint algorithm about infering lifetimebound.

In case of functions, we say that a function only has a certain property, if every other function it calls also has that property. We definitely want to intersect here and this is not a fixpoint algorithm.

The fixpoint algorith we’ve been talking about was mentioned here.

Xazax-hun:

I think this depends on the problem. Imagine, we want to infer lifetimebound:
int *f(int *a [[clang::lifetimebound]], int *b, int *c, int *d) {
  if (rand() % 2) return a;

  return f(b, c, d, a);
}
Here, the summary might have the information that the first argument is lifetimebound. But because of the call f(b, c, d, a) we could infer that the second parameter is also lifetimebound. However, because of the f(b, c, d, a) call, now we know that the third parameter is lifetimebound. And so on, we can deduce all of them are lifetime bound. The point is, once we discovered that a function parameter is lifetime bound, we need to go back and reprocess all the summaries that depends on this function because we can now potentially infer additional cases of lifetimebound parameters. In a cycle (this example was a cycle of one), we might need to traverse the cycle multiple times to propagate all the information.

Now normally, we could just say that we have X e.g.: JSON summaries, concatenate them into only 1 JSON and try to answer some question based on it.

Is the function we marked as no-write-global, really does that? Only if every function is calls is also marked as no-write-global. This is an intersection, not a fixpoint.
A parameter of a function is marked as noescape, is it correct? Only if it is not passed to other functions, or the functions it is passed to as an argument also has that argument marked as noescape. This is also an intersection, and not a fixpoint.
This parameter has no attributes, but could we marked it as lifetimebound? Well, it is passed to a function as a lifetimebound argument, and we also return what that other function returned to us, so the parameter could be marked lifetimebound. Now that we know that this parameter is lifetimebound, can we mark any other parameters as lifetimebound too? This is a fixpoint and requires union.

Ideally all of these questions would have a separate pass that tries to answer them, but in that case the summary of the program is processed by each pass.

The question is, whether there is a way to answer all of these questions by only processing the summary of the program once. Seemingly what we have to do with 2. and 3. contradict each other, so probably one processing is not enough.

steakhal · April 23, 2025, 2:08pm

Let me async write a couple properties I’d be interested to compute among others:

deducing noreturn functions, across TUs:
If function “A” must call “B” and “C”, then “A” is noreturn if any of the functions it can’t avoid is noreturn. This would help identifying library custom assertion functions, fairly common in testing libraries, among others in gtest and reduce FPs.
deducing what files should be closed:
We would need to know what parameters a function “may” close, and any output parameters/ retval that “must” give an opened handle. (To make sure the stream checker can transition the states correctly).
detecting uninitialized variables:
We would need to know what parameters a function “must” be able to read, what it “may” initialize.
detecting file handle leaks:
We would need to know if a function “may” close a handle parameter, and what handles it “must” return or set (if it opens files inside).

Consequently, many CSA checkers need actually some “must” and “may” properties at the same time to implement state transitions correctly.

One of my college enumerated some checkers (40-or so) with this mindset of what “may” and “must” properties each checker would benefit from. I’ll ask if we could share some of the results.

isuckatcs · April 23, 2025, 6:48pm

I was also thinking a bit more about what we should store in the summaries. I think we should only store attributes that we can take full responsibility for. These are the attributes we can infer during the “map” phase. Any user provided attribute is discarded.

Let’s say, the user writes this function in their code.

int *f(int *a [[clang::lifetimebound]], int *b [[clang::noescape]], int *c, int *d) {
  return nullptr;
}

We run our “map” phase, so our passes infer the following attributes.

int *f(int *a [[clang::noescape]], int *b [[clang::noescape]], int *c [[clang::noescape]], int *d [[clang::noescape]]) {
  return nullptr;
}

This would leave us with the following summary. We couldn’t infer [[clang::lifetimebound]], so we discarded it. It wasn’t correct anyway.

[
  {
    "id": "f",
    "called": []
    "params": [["noescape"], ["noescape"], ["noescape"], ["noescape"]], // The index of the element in the array corresponds to the index of the parameter.
  }
]

This way we can at least merge the summaries without worrying about them not being correct. We still have the issue though, that in case of parameter flow, noescape has to be intersected, while lifetimebound should be unioned, so we should allow custom reduction logic for each attribute, we support.

This way we can have 1 engine that performs a fixpoint iteration on the summaries and during each iteration, it dispatches these merge functions until something is changed. This grants us a constant number of traversals on the summaries, so it doesn’t scale based on the number of analysis passes, attributes, etc. we create.

IIUC, there are only 2 kinds of attributes, those with an “every” relationship and those with an “any” relationship.

A few examples for the “every” relationship:

A function is no-write-global only if “every” other function it calls is no-write-global too.
A parameter is noescape if we could infer it and “every” time it is passed to a function as an argument, the argument is also marked noescape.

A few examples for the “any” relationship:

A function is noreturn if “any” other funtion it calls is noreturn.
A parameter is lifetimebound if we could infer it, or we return the result of “any” other functions, it is passed to as a lifetimebound argument.

For the “every” type attributes, we want to intersect and for the “any” type attributes, we want to union.

WDYT?

isuckatcs · April 23, 2025, 6:58pm

IIUC, we don’t have to model this relationship inside the framework, the checkers that look for these will just flag the parameter or the function with an attribute. The only thing we have to figure out is are they unioned, or intersected during the reduction phase. See the examples below.

Seen a parameter that is passed to fclose(), or such? - flag it as closes-handle
The function opens a handle that is not returned, or assigned to any output parameter? - flag the function as opens-unreturned-handle

The parameter is read? - flag it as value-read
The parameter is written? - flag it as value-overwritten

Seen a parameter that is passed to close(), or such? - flag it as closed
Seen an output parameter that is set to an opened handle? - flag it asopens-handle

steakhal · April 24, 2025, 6:50am

Can’t we let the different summaries decide how they want to implement the merge?
I don’t see it clear that deciding upfront a merging strategy would bring value compared to restricting the design.

I want to also mention that it’s usually not enough in C++ to attach the deduced attributes to parameters. For our use cases we would need to be able to describe access-paths, such as param.first->data was allocated.

isuckatcs · April 24, 2025, 8:35am

This is what I meant initially, though at this point I don’t see what else we would do with the attributes besided the 2 mentioned operations.

Do you have an example use-case where we want to do some other operation that is nether intersection, nor unioning? Do we want to allow some side-effects during reducing the attriutes? If yes, what are the benefits to that?

I’m fine with attaching some kind of a metadata to the attributes. We just need to figure out how to store and reduce them. We want to attach the attributes directly to the AST after we reduced them, and since the metadata can be big, we might need a separate storage for it.

So, what should we allow in this metadata? Any arbitrary values, or just a subset of values? In case of param.first->data is storing the USR enough, or do we need something else?

Xazax-hun · April 24, 2025, 9:16am

I am not sure if that is a good idea. These annotations might encode important contracts. E.g.:

int *f(int *a [[clang::lifetimebound]], int *b [[clang::noescape]], int *c, int *d) {
#ifdef FEATURE_ON
  return a;
#else
  return nullptr;
#endif
}

So the compiler not seeing the justification for an attribute does not mean there is no justification. It might be under a preprocessor macro. Or it might be an anticipated change in the future and they wanted to add the contracts early to not break callers once the missing feature is added.

This is something we cannot tell for sure, see my explanation above.

I think this one is inevitable if we want a general framework.

This matches what I had in mind.

This is not quite true. It is only the case if the noreturn functions dominate all the exits. Maybe this is another piece of interesting information to persist?

isuckatcs · April 24, 2025, 10:07am

Well, we don’t know what the intents of the user are. We can only check if what the attribute says persists, or not. By discarding the attributes and only keeping the infered ones, we could actually emit a warning when the attribute doesn’t persist, depending on how accurate our analysis is.

I’m afraid that if we propagate information that we are not responsible for, we can end up seeing a lot of false positives because of 1 or 2 misleading attributes

Yes, maybe-noreturn would have been a better attribute name.

Domination as it is probably shouldn’t be recorded, but the flow of functions could be recorded as a graph, like a CFG where only the calls are inserted. That can be used to figure out what dominates what later.

On the other hand though, the initial proposal was for flow-insensitive checks, and this problem seems to be flow sensitive. Do we want to support flow-sensitive checks too?

Topic		Replies	Views
[Analyzer][RFC] Function Summaries Static Analyzer	9	186	February 10, 2020
[analyzer] Summary IPA thoughts Static Analyzer	25	373	April 8, 2016
Some problems about function summary Clang Frontend	2	101	June 13, 2011
RFC: Moving the module summary into the irsymtab LLVM Dev List Archives	5	122	June 1, 2017
RFC: LLVM Assembly format for ThinLTO Summary LLVM Dev List Archives	42	283	May 10, 2018

[RFC] Summary Based Analysis Prototype

Motivation

Implementation Details

The Map Phase

The Summary

The Reduce Phase

Implementation Plan

Related topics