Manuel (and others):
Just curious: what part of your cross-TU analysis is blocked on this? We’ve been doing some forms of cross-TU static analysis without modules without any problems (apart from parsing time, obviously
The short version:
We want to do sound static analysis on chunks of programs that are larger than single TUs, but smaller than the entire program. Our particular analysis involves inference, but the issues are pretty much the same for many static analyses.
The long version for those who have both time and interest:
To be sure we’re all on the same page, I’ll start with background that you probably all know. Performing sound inference on source code requires at least the following properties:
1: a known starting point for the properties being inferred.
2: complete (enough) access to the code on which the inference will be performed.
3: proof that the inference algorithm can consider (close enough to) all paths that may-reach the code on which the inference will be performed.
The conceptually easy—but pragmatically difficult—way to meet conditions 2 and 3 is to just analyze the entire program. When feasible, this approach lets you elide the parenthesized parts of those statements.
Some key real-world challenges for this approach include:
B: Libraries. Few projects have access to all of the source code for all of the libraries they use. What with proprietary operating systems, proprietary libraries, and so on, this is often a real stumbling block.
C: Libraries. Library authors who wish to analyze their own code fundamentally cannot have access to “the entire program.” After all, much of the client code for their library hasn’t been written yet.
D: Program size. Committing to whole-program analysis essentially guarantees that the analysis will be slow, because whole programs (including all required libraries) tend to be very large. Developers generally want to get actionable results in a timely manner.
So static analysis developers who desire real-world impact generally fall back on divide-and-conquer approaches. We can meet the necessary requirements for sound inference across partial programs when we:
1: Know exactly what constitutes the interface between the chunk we’re analyzing and the rest of the program. Most important for inference is the interface this chunk exports to the rest of the program. Of course, knowing the interfaces exported by other program chunks that are visible to the chunk under analysis helps too, by letting us process only the interfaces while skipping the implementations. Modules provide this today (or close enough).
2: Place analysis starting point information at those interface points. For some analyses, this could be information that’s already in the source code; for others, this could be annotations. In any case, the starting point information must be treated as analysis cut points, so we can depend on it from inside the program chunk and enforce it from the outside. Modules (plus source code annotations when needed) provide this today as well.
3: Knowledge of the complete contents of the program chunk we’re analyzing. This provides the completeness property that is required to enable sound analysis of the inside of the program chunk. Modules could provide this, if only there were a way to enumerate the TUs that implement the module. Of course each individual TU provides this property by default for TU-at-a-time analysis. But past experience has shown that TU-at-a-time analysis requires too much hand-written annotation for most programmers and most projects because the ratio of interface-to-content fails to provide enough leverage at this scale. Larger program chunks have a more favorable (i.e., lower) interface-to-content ratio. [Note: I’m well aware that some development groups choose per-TU analysis anyway, which is fine. If one can support divide-and-conquer at all, it’s reasonably straight-forward to support per-TU analysis.]
Note that we can “ignore” dynamic dispatch and function pointers by adding only one more rule: overriders must obey the "analysis starting point information” found on their ancestors (and similarly for function pointers and the function references assigned to them). This works because the definition, implementation and invoker (or its ancestor) must either be (a) all in the same module, or (b) the definition must appear in an interface somewhere. Case (a) works because we have all the relevant code in hand. Case (b) works because the analysis cut-point information provides enough knowledge to do the analysis even when the implementation and/or caller are in modules other than the one that exports the interface.