RFC: SSAF Entity Linker
Aviral Goel, Static Security Tools, Apple
Summary
This RFC presents the design of the Entity Linker, a component of the Scalable Static Analysis Framework (SSAF). The Entity Linker combines multiple translation-unit (TU) summaries into a unified link-unit (LU) summary by deduplicating program entities based on C and C++ linkage rules and merging their summary data in an analysis-agnostic manner. Inspired by binary linkers, it operates at a higher semantic level to support whole-program static analysis. The implementation supports linking arbitrary analysis data encoded in extensible serialization formats and handles large-scale codebases by linking one translation unit at a time.
This RFC is organized as follows: Motivation explains why the Entity Linker is necessary with an Illustrative Example. Design Principles discusses core implementation decisions. Implementation describes the Linking Algorithm and Code Organization. Limitations discusses constraints and trade-offs. Alternatives Considered evaluates several rejected approaches. Future Implementation Work outlines possible future enhancements.
Motivation
The SSAF framework enables whole-program static analysis and source code transformation by introducing a three-phase workflow: summary extraction during compilation, summary linking across translation units, and summary analysis on the merged data during client analysis. The Entity Linker implements the middle phase, serving as the bridge between per-TU and whole-program views. It presents a unified view of the program to client analyses by resolving program entities across all TUs based on language linkage rules.
Illustrative Example
Consider a simple two-file program app where math.cpp defines an add function and a static helper function, and main.cpp declares add, defines main, and also has its own static helper function.
math.cpp
// Static helper function (internal linkage, only visible in this file)
static int helper(int x) {
return x * 2;
}
// Public add function
int add(int a, int b) {
return helper(a) + b;
}
main.cpp
// Forward declaration of add function from math.cpp
int add(int, int);
// Static helper function (internal linkage, only visible in this file)
static int helper(int result) {
return result + 5;
}
int main() {
int sum = add(5, 3);
return helper(sum);
}
After separate compilation, we have two TU summaries, each with its own entities referenced using local identifiers. The math.cpp summary contains entities for add with external linkage and helper with internal linkage in the math namespace. The main.cpp summary contains entities for add (external), main (external), and helper (internal) in the main namespace. Both summaries also include analysis data that references these entities by their local identifiers. Before performing whole-program analysis on this data, we must solve several interrelated problems. First, both translation units reference the add function as TU-local entities, but only one canonical entity should exist for the whole program. Second, both files have a helper function with internal linkage that must remain distinct despite having the same name. Third, both summaries use overlapping local identifiers for their entities, requiring a resolution pass to assign globally unique identifiers. Finally, all entity identifier references embedded in the analysis data must be updated to use the new global identifiers. The Entity Linker solves these problems.
Design Principles
Three core principles guide the architectural decisions and implementation choices of our Entity Linker.
- Binary Linker Alignment - The Entity Linker follows the semantics of binary linkers where possible while operating at a higher abstraction level. This alignment grounds static analysis behavior in the concrete execution properties of programs, allowing developers to leverage their existing mental models of compilation and linking.
- Unified Program View - The Entity Linker presents a unified whole-program view to client analyses by handling cross-translation-unit entity resolution and data merging. This ensures that analyses focus solely on program semantics rather than **** reconciling fragmented per-translation-unit views.
- Analysis-Agnostic Operation - The Entity Linker operates without knowledge of summary data schemas. This enables the linker to function as a standalone tool that automatically handles summaries from any analysis without recompilation.
Implementation
The Entity Linker is a standalone tool invoked by the build system at linking edges of the build graph, where binary linkers are invoked. The linker takes as input serialized TU summary files (one per object file) and produces a single serialized LU summary that provides a whole-program view to client analyses.
Linking Algorithm
The linking algorithm operates in three phases: resolution transforms entity names and assigns globally unique identifiers, merging combines analysis data from multiple translation units, and patching updates entity references in the analysis summary data.
Phase 1: Resolve
The resolution phase combines individual TU entity tables into a unified LU entity table by assigning globally unique identifiers to every TU-local entity. This transformation has two goals: deduplicate external linkage entities with the same name, and disambiguate internal linkage entities with the same name. To achieve this the linker transforms entity namespaces based on their linkage. External linkage entities are assigned the namespace of the target link unit, making them visible across the entire linked program. Entities with the same name and external linkage thus receive the same namespace, enabling deduplication. Internal linkage entities have their namespaces qualified under the target link unit namespace, creating a hierarchical namespace that distinguishes local entities with the same names from different translation units. The linker inserts these transformed entities into the output entity table to assign fresh identifiers. If an external linkage entity with the same name already exists in the table, the linker reuses its identifier. Internal linkage entities always receive fresh identifiers because their qualified namespaces ensure no conflicts with entities from other translation units.
Phase 2: Merge
The merging phase transfers summary data from TU-local entities to corresponding globally unique entities. For each entity processed in the resolution phase, the linker retrieves the summary data associated with its TU-local identifier and assigns it to its new identifier in the output LU data structure. For external linkage entities, such as class definitions included in many TUs, multiple TU-local identifiers map to the same global identifier. In this case, the summary data associated with the first encountered TU-local identifier takes precedence; subsequent data is dropped. Unless there is an ODR violation, these summary data instances will be semantically equivalent, so it does not matter which one is picked by the linker. The entity linker does not detect or report ODR violations because checking for semantic equivalence would require knowledge of the analysis data schema.
Phase 3: Patch
The patching phase updates all entity identifier references embedded within the summary data transferred to the LU data structure. Since the linker has no knowledge of the analysis data schema, it delegates this responsibility to the serialization format implementation by providing an entity reassignment table mapping old identifiers to new identifiers. Each serialization format must implement patching logic appropriate to its data representation. Notably, serialization format implementations are also not expected to understand the analysis data schema; patching is purely syntactic replacement within the data stream or its intermediate representation. Typical implementations include either prefix bytes to mark entity identifiers or a table identifying their locations in the data stream. As part of this proposal, we implement patching support for JSON format. Values corresponding to keys beginning with @ are interpreted as entity identifiers. The JSON patcher recursively traverses the JSON data structure to identify these keys and perform replacements.
Code Organization
The Entity Linker implementation will reside in the LLVM project structure under the SSAF framework.
- The implementation and its associated data structures will be in
clang/lib/Analysis/Scalable/EntityLinker/with corresponding headers inclang/include/clang/Analysis/Scalable/EntityLinker/. - The command-line tool for build system integration will live in
clang/tools/ssaf-linker/. - Documentation will be shared in
clang/docs/ScalableStaticAnalysisFramework.
Limitations
The current design of the Entity Linker prioritizes correctness and clarity over performance, adding overhead to the linking phase proportional to the number of translation units being linked and the size of their analysis data. We could address this in the future through optimizations such as parallel linking of TU summaries and efficient patching using pre-computed fixup tables in binary serialization format.
The Entity Linker does not retain multiple summary data instances for external linkage entities with the same name, aligning with the C++ One Definition Rule. However, this may limit analyses that want to study ODR violations to identify build configuration issues, header file problems, or portability bugs.
The Entity Linker delegates the complexity of patching entity identifier references to serialization format implementations. A buggy patching implementation could silently corrupt analysis data by failing to update entity references or updating them incorrectly. The linker framework cannot validate patching correctness since it has no knowledge of analysis data schemas. Format implementations must include thorough testing to ensure all entity references are properly identified and updated.
Alternatives Considered
We considered several alternative designs but rejected them due to their shortcomings.
1. Using Pointers to Represent Entity Identifiers
Instead of using opaque integer indices as entity identifiers, we could use memory addresses of allocated entity metadata to ensure uniqueness automatically, since each allocated object has a distinct address. This approach has three significant drawbacks. First, all TU summaries must be loaded simultaneously during linking to guarantee uniqueness. Second, when a translation unit summary is deserialized, entity objects are allocated at addresses that differ from their serialization-time addresses, requiring identifier updates throughout the summary data. Third, entity identifiers that change between runs based on allocator behavior complicate testing and debugging.
2. Implementing Entity Resolution in Client Analysis
The Entity Linker could instead compute only entity equivalence classes, which are sets of cross-TU entity identifiers representing the same global entity. The linker output would include these equivalence classes along with the input TU data. Client analyses would traverse equivalence classes, pick canonical representatives, merge data from equivalent entities, and update references. This approach minimizes linker work, potentially speeding up builds, and gives analyses full control over deduplication strategy. However, this approach fails to provide a clean whole-program abstraction, requiring client analyses to handle the multi-TU entity linking complexity. If the linker already computes equivalence classes, completing deduplication and merging is a natural extension that significantly simplifies analysis implementation. Our framework’s value proposition is precisely to factor out these common concerns; making entity deduplication the analysis’s responsibility undermines this goal.
3. Implementing Entity Linking in Binary Linkers
Given the alignment with binary linkers, we considered extending them to perform entity linking rather than designing a new tool. This would provide single-tool invocation producing both the binary executable and the linked summary, and ensure synchronization between binary and summary data. However, binary linkers work with machine code representations. Embedding analysis data in object files would require either extending existing object file formats or defining new sections, both of which have ecosystem implications. The patching logic would need to be embedded in the linker, creating tight coupling between serialization formats and linker implementations. A standalone Entity Linker can be developed and evolved independently while integrating naturally with the build pipeline.
4. Patching Implementations
To maintain analysis agnosticism, we delegate patching to serialization format implementations. Alternative approaches include implementing a concrete patching algorithm in the linker itself:
- Entity Markers - Embed special marker bytes in the serialized format before each entity identifier. The linker scans the data stream for these markers, interprets the following bytes as entity identifiers, and replaces them with new values.
- Fixup Tables - Require every serialization format to compute a fixup table during summary extraction that provides entity identifier locations in the serialized data. The linker uses this table to locate identifiers for replacement.
Both approaches increase summary-data size and require separate implementations for textual and binary formats, since integer identifiers in text-based formats have variable length. All serialization format implementations would inherit these drawbacks. The proposed approach avoids these problems by letting implementers choose appropriate encodings for analysis data, at the cost of implementing corresponding patching functions. This is a one-time implementation cost, and patching algorithms can leverage encoding details for optimizations.
An alternative design that avoids patching would have the Entity Linker perform resolution and emit local-to-global entity identifier mapping tables in the linked output for each TU. Client analyses would use these tables to update TU-local identifiers in summary data. This approach works with any serialization format and requires no changes to summary data itself. However, beyond increasing output size, this approach breaks the whole-program abstraction and requires client analyses to handle multi-TU entity management complexity.
Future Implementation Work
Future work could extend this design with the following capabilities:
Multi-Library Builds - The current implementation produces a single executable LU from the inputs. Future work could extend this to support multi-library applications composed from static archives and shared libraries. To support static archives, the Entity Linker would package TU summaries in the LU without merging; in the final link step, only summaries for referenced entities will be included. To support shared libraries, the Entity Linker must model which entities are exported from each library and how dynamic linking resolves references across library boundaries. This requires adding visibility metadata (hidden, default, and protected) to entity linkage. For plugin systems that use weak symbol attributes, the linkage metadata will be extended to record weak and strong annotations, and the Entity Linker will model their semantics.
ODR Violations - ODR violations could be detected and reported in debug mode. The Entity Linker could show source locations for entity definitions with the same identifier but incompatible summaries. This would require either implementing isomorphic equivalence checking for summary data or comparing ODRHash of program entities, a hash value computed from AST nodes designed to detect ODR violations in C++ modules.
Unused Entity Elimination - Unused entity elimination would remove unreferenced entities from the linked summary, similar to how binary linking removes unreferenced code sections. For projects that transitively include large headers, translation units can contain tens of thousands of entity references, most never used. Eliminating these could significantly reduce summary size and improve analysis performance. Implementing this optimization requires a list of entities referenced by every entity in the summary data to build a reachability graph and remove unreferenced entities.
Parallel Linking - Linking time can become significant for large projects with millions of entities. Parallel linking could process multiple TU summaries concurrently. The resolution and merging phases would require appropriate synchronization, but patching can run in parallel without conflicts.
Incremental Linking - Incremental linking would allow reusing portions of a previous link when only some input TU summaries have changed. This requires maintaining stable entity identifiers across builds, determining which parts of the previous link can be reused, extending the Entity Linker to accept the previous link output (LU) as input, and supporting efficient incremental updates in the serialization format.
Optimized Binary Serialization Format- This proposal describes patching support for TU summaries in JSON format due to its simplicity and ease of testing. Future work could add an optimized binary serialization format for faster linking.
Custom Serialization Formats - The SSAF framework leverages LLVM’s Registry to support registering and discovering pluggable serialization formats. The current implementation only supports formats provided by the framework since they are registered at compile time. Supporting dynamically registered formats would require implementing a command-line flag to specify a shared library path that will be loaded to update the serialization format registry, allowing the Entity Linker to discover the format.