[analyzer][tooling] Analyzer architecture


In order to not overburden the previous discussion about Analyzer and Tooling, I would like to ask you opinions on a related but slightly orthogonal matter.

Gabor and I had a brainstorming session about the issues CTU analysis and compilation command handling (previous topic) brought up recently.

Note that these points are to be regarded as cursory expeditions into the hypothetical (at best).

The train of thought regarding CTU analysis had the following outline:

  • We need a tool that gets a FunctionDecl (the function which we would like to inline) and returns with an AST to its TU.
  • the fitting abstraction level of the result seems to be the TU level
  • externalDefMapping.txt is just an implementation detail, actually we don’t need that.- Let’s call this tool **ASTServer**.
  • ASTServer has some resemblance to clangd.
  • Works on the whole project
  • Uses compilation DB
  • Persists already parsed ASTs in its memory (up to a limit)
  • (Cache eviction strategies? LRU?)- The AST would be returned on a socket and in a serialized form (ASTReader/Writer).
  • could also work over the network, promoting distribution- We need another tool: **clang-analyzer** !!!
  • Actually we should have done this earlier
  • Utilizes clang for analysis purposes
  • Handles comm with ASTServer
  • Caches ASTs from the server- external orchestrator tool CodeChecker tool would launch ASTServer and then would call clang-analyzer tool for each TU, thus conducting the analysis.

The reasoning behind the separation:

The analyzer is a complex subsystem of Clang. The valid concern of clang binary growing out of proportion, and the increasing need for

tooling dependencies surfacing due to CTU analysis indicate the need reorganizing facilities.

The point is further backed by the argument that a complex functionality of interprocess communication (over sockets in our example)

is even less desirable inside the clang binary than binary size bloat.

Also the complexity of the whole solution could be distributed, and concerns of build system management, build configuration formats

can be separated from the analyzer itself (but allows for a wide variety of build-system vs analysis cooperation schemes to be implemented).

Again, the scope of these ideas is not trivial to assess, and would probably require a considerable amount of effort,

but I hope an open discussion would outline a solution that benefits the structure of the whole project.



Having a server that provides syntax trees (and such other elements) has been investigated and developed by zapcc https://github.com/yrnkrn/zapcc in the past. IIRC AFAIK they start N threads in the build system in the background and the real compiler called produces results using this background server. Not sure if “sockets” are used, however…

Maybe it was discussed earlier, but isn’t there a way to split SA from the clang binary somehow? I know we have “CLANG_ENABLE_STATIC_ANALYZER=OFF”.

Endre Fülöp via cfe-dev <cfe-dev@lists.llvm.org> ezt írta (időpont: 2020. ápr. 29., Sze, 10:15):

It’s worth carefully thinking about your design goals for this system. Particularly how much you value:

  • predictability (isolation and debugging)

  • efficiency (e.g. in terms of total CPU usage)

  • scalability (often in tension with efficiency)

We’ve had some good experience with a mapreduce approach for cross-TU analysis, for dead-code analysis etc.
The idea is your analysis is composed of pure functions that run on a single TU.

e.g. for inline-function, this would be:

  1. [Prepare] analyze the TU containing the target function, this is a function (input spec, TU AST) → function AST
  2. [Map] analyze every TU to find occurrences and compute edits, this is a function (TU AST, function AST) → [(file, edit)]
  3. [Reduce] group by file and reconcile edits, this is a function (file, [edit]) → edit

It trades off a bit of efficiency to be highly predictable (pure functions are easy to test, intermediate states can be saved for analysis, bugs are easily localizable to TUs) and scalable.

It does require your intermediate data to be serializable, but distributing over a network server does too. Having the “framework part” not be too opinionated about the form of this data gives some useful flexibility.

Compared to this, your ASTServer seems to sacrifice scalability and predictability for efficiency if I’m understanding it correctly, it’s worth carefully considering whether this is the right tradeoff (e.g. it only makes sense if your analyses are often slow enough to be worth squeezing this efficiency out of, but fast enough that they don’t need to be seriously distributed).

The Tooling libraries have fair support for Map steps, but none for Reduce and nothing very useful for stringing steps together. It’s possible to bolt this stuff on but I regret that we haven’t added it.


Thanks for the insight. It could definitely open new horizons if we could distribute AST manipulation in a map-reduce fashion.

I would just like to point out, that on IR level, the workflow introduced is definitely worthwhile to consider, because the edits

in that case is just inlining IR instructions at callsites. While there may arise some difficulties there (I am just making assumptions as

I am no expert in IR or CodeGen), (almost-) context-insensitivity helps a lot. In case of ASTs however one function node can contain references to

other nodes all around the TU. It is not trivial to inline a function, ASTImporter library does exactly this, and It works by importing the function

itself and all of its dependent declaration and types. This lead to the consideration, that TUDecls should be the atomic units of CTU.

I am still not fully aware of the implications of these arguments on the architecture you mentioned, but I will be investigating, and any discussion is welcome on my part.