Module Questions

Doug (and others):

I'm working on static analysis of C and C++ code, and need a mechanism for processing groups of TUs that have both:
* a known and statically complete interface (dynamic dispatch & function pointers are a different problem that need not be considered here)
* known and statically enumerable contents.

These properties would enable interesting cross-TU analyses, including some forms of sound inference across TUs.

Reading through the various Modules information I've found (e.g.,http://clang.llvm.org/docs/Modules.html etc.), I see that Modules appear to provide about 90% of what I'm looking for. Specifically, there appears to be a map between modules and the headers that comprise their interfaces. By extension, this map also seems to identify some header files that are excluded from the interface of a module.

What I haven’t seen is a map between modules and the TUs that implement them. For my purposes, such a map would indicate (a) for a given TU, which module (or modules) it is part of (if any), or (b) for a given module, which TUs provide its implementation. In principle, one could build the entire mapping between TUs & modules given a complete set of either of (a) or (b).

Does such a map exist? Have I just failed to spot it? If not, would something along these lines be an interesting new capability (opt-in only, of course)?

In addition, would it make sense to allow a TU to claim that it is part of the implementation of a module, perhaps via an attribute in the source code? Such a claim could be useful for static analysis even if it were totally ignored by the rest of the compiler and build system. Indeed, I'd expect such an attribute would indeed be ignored by the rest of the system for the foreseeable future.

Dean Sutherland
dsutherland@cert.org

P.S. Doug — we spoke briefly about this at the LLVM Dev conference in November. This is the first of several long-delayed follow-up messages.

I have a few more questions about Modules that I forgot to ask in my last message:

Can a header be part of the specification of more than one module? What I’ve read seems to indicate that a header could be part of both a sub-module and of its parent module(s). But what about otherwise unrelated modules?

Similarly, have you seen translation units that would reasonably be considered to be part of the implementation of multiple modules? If so, are those modules parent-child, or completely unrelated?

If you’ve seen either of the above cases, are they common or unusual?

Dean

Doug (and others):

I'm working on static analysis of C and C++ code, and need a mechanism for
processing groups of TUs that have both:
* a known and statically complete interface (dynamic dispatch & function
pointers are a different problem that need not be considered here)
* known and statically enumerable contents.

These properties would enable interesting cross-TU analyses, including
some forms of sound inference across TUs.

Just curious: what part of your cross-TU analysis is blocked on this? We've
been doing some forms of cross-TU static analysis without modules without
any problems (apart from parsing time, obviously :wink:

Manuel (and others):

Just curious: what part of your cross-TU analysis is blocked on this? We’ve been doing some forms of cross-TU static analysis without modules without any problems (apart from parsing time, obviously :wink:

The short version:

We want to do sound static analysis on chunks of programs that are larger than single TUs, but smaller than the entire program. Our particular analysis involves inference, but the issues are pretty much the same for many static analyses.

The long version for those who have both time and interest:

To be sure we’re all on the same page, I’ll start with background that you probably all know. Performing sound inference on source code requires at least the following properties:
1: a known starting point for the properties being inferred.
2: complete (enough) access to the code on which the inference will be performed.
3: proof that the inference algorithm can consider (close enough to) all paths that may-reach the code on which the inference will be performed.

The conceptually easy—but pragmatically difficult—way to meet conditions 2 and 3 is to just analyze the entire program. When feasible, this approach lets you elide the parenthesized parts of those statements.

Some key real-world challenges for this approach include:
B: Libraries. Few projects have access to all of the source code for all of the libraries they use. What with proprietary operating systems, proprietary libraries, and so on, this is often a real stumbling block.
C: Libraries. Library authors who wish to analyze their own code fundamentally cannot have access to “the entire program.” After all, much of the client code for their library hasn’t been written yet.
D: Program size. Committing to whole-program analysis essentially guarantees that the analysis will be slow, because whole programs (including all required libraries) tend to be very large. Developers generally want to get actionable results in a timely manner.

So static analysis developers who desire real-world impact generally fall back on divide-and-conquer approaches. We can meet the necessary requirements for sound inference across partial programs when we:
1: Know exactly what constitutes the interface between the chunk we’re analyzing and the rest of the program. Most important for inference is the interface this chunk exports to the rest of the program. Of course, knowing the interfaces exported by other program chunks that are visible to the chunk under analysis helps too, by letting us process only the interfaces while skipping the implementations. Modules provide this today (or close enough).
2: Place analysis starting point information at those interface points. For some analyses, this could be information that’s already in the source code; for others, this could be annotations. In any case, the starting point information must be treated as analysis cut points, so we can depend on it from inside the program chunk and enforce it from the outside. Modules (plus source code annotations when needed) provide this today as well.
3: Knowledge of the complete contents of the program chunk we’re analyzing. This provides the completeness property that is required to enable sound analysis of the inside of the program chunk. Modules could provide this, if only there were a way to enumerate the TUs that implement the module. Of course each individual TU provides this property by default for TU-at-a-time analysis. But past experience has shown that TU-at-a-time analysis requires too much hand-written annotation for most programmers and most projects because the ratio of interface-to-content fails to provide enough leverage at this scale. Larger program chunks have a more favorable (i.e., lower) interface-to-content ratio. [Note: I’m well aware that some development groups choose per-TU analysis anyway, which is fine. If one can support divide-and-conquer at all, it’s reasonably straight-forward to support per-TU analysis.]

Note that we can “ignore” dynamic dispatch and function pointers by adding only one more rule: overriders must obey the "analysis starting point information” found on their ancestors (and similarly for function pointers and the function references assigned to them). This works because the definition, implementation and invoker (or its ancestor) must either be (a) all in the same module, or (b) the definition must appear in an interface somewhere. Case (a) works because we have all the relevant code in hand. Case (b) works because the analysis cut-point information provides enough knowledge to do the analysis even when the implementation and/or caller are in modules other than the one that exports the interface.

Dean Sutherland
dsutherland@cert.org

New Year’s ping for this thread in general, now that folks are getting back to work.

Dean

As I’ve said - we have done global code analysis for a while now, and I’m not sure what modules would help (apart from speed ups). If you want to restrict yourself to certain depths / components, I’d suggest to do this on a higher level. I don’t think I have more insights, sorry…

Cheers,
/Manuel