Motivation
After the round tables at EuroLLVM 2025, we had a chat with @Xazax-hun about implementing a prototype for the sumamry based approach, and seeing how it affects the quality of the bug reports, first, in the Clang Static Analyzer.
Implementation Details
I imagine a similar implementation that we see with the compilation database. Instead of the build system emitting a compile_commands.json
file though, we add a new flag to the compiler, e.g.: -emit-summary
and let the compiler create a <source-file>-summary.json
file.
The various clang tools could then consume this similar to how the compile_commands.json
is consumed. Except, instead of accepting only one summary, the tools could take multiple ones, that will get merged upon processing.
The Map Phase
During the round tables, a MapReduce like infrastructure was proposed. At first glance, I would imagine the Map part to happen inside Sema
as a FunctionDecl
is processed.
We would have a class, e.g.: Summarizer
that runs a series of analysis passes on the function, after it has been resolved by Sema
and it’s body is known. The result is immediately appended to the <source-file>-summary.json
file. This way, by the time clang
finished processing the AST, we will also have the summary if it was requested.
The Summary
JSON seems to be a convenient format for the summaries, as in the LLVM project, we can already work with it. IIUC, we need to store at least 3 things about a function. It’s unique identifier, a set of attributes we could infer, and the other functions it calls.
Take the following snippet as an example, and assume we want to figure out if the function writes to global variables or not.
void a() {
int x = 0;
b();
}
The function a()
doesn’t write to global variables if it doesn’t contain a direct write to a global variable and b()
doesn’t write to global variables either. On the other hand b()
might also be defined in a separate translation unit, so we only perform inter-procedural analysis, i.e.: check for direct writes only. As a result, we end up with the following summary.
[
{
"id": "a",
"attrs": ["NO_WRITE_GLOBAL"],
"deps:": ["b"]
}
]
The b()
function might look like this:
void b() {
global = 1;
}
In this case, we end up with this summary:
[
{
"id": "b",
"attrs": [],
"deps:": []
}
]
As for identifying the functions, using their names is not an ideal option, as there can be multiple functions with the same name (overloads, static functions local to source files, etc.), so we have a couple of options here. USR, MangledName, ODRHash could all be viable options, though USR seems to be the most convenient, as it has a way to deal with static functions in different translation units with the same signature.
The Reduce Phase
Reducing the summaries can either happen in a different class, or we could reuse the Summarizer
, so that it can both summarize a function and add attributes to it based on existing summaries.
In either case, we would parse all the summaries given to the various clang tools, and for each function, to figure out it’s actual attributes, we could run the following algorithm:
attrs calculateAttrs(FunctionSummary F) {
attributes = F.attrs;
for(D in F.deps)
attributes = intersect(attributes, calculateAttrs(getSummray(D)));
return attributes;
}
In other words, we probably only want to keep those attributes that also apply to every function the visited function calls. We would also use some form of caching, so that the same function is not evaluated multiple times. In case of cycles, we can probably just return before the for
loop.
Depending on how expensive this is, the algorithm could also be run while Sema
processes the source file and sees a function call, so that the emitted warning can also use the so far existing summary. This might be difficult though, as in C++ a function might be defined only after it is called, and in those cases we don’t know anything about it’s body as we haven’t seen it yet.
void foo();
void bar() {
foo(); // when we see this, we couldn't have analyzed the body of 'foo()'
}
void foo() {
}
As for using these attributes, I see two approaches. The first is to add them directly to a FunctionDecl
in the AST, so they are available automatically everywhere, but this might require modifying the AST. The other approach is to store them in a different object, so the AST is not modified, but with this approach every callsite, that wants to read these attributes needs to be modified.
Implementation Plan
With @Xazax-hun we discussed starting small first, and only implement reasoning about writing to globals or not, because it is a relatively simple analysis, and the CSA could use this information directly for conservative evaluation.
This also requires implementing the generation and the consumption the summaries. Once we have a working prototype and see what blockers we have, how expensive this approach is, etc. we can think about the additional steps, such as build system integration.
I wonder if you have any thoughts on this, or you might see some cases, we’re missing here, or know better, more efficient approaches. Any feedback is appreciated.