Overview
For profile-guided call graph sort, graph edges are constructed by call graph profile pass and used by linker to sort call graphs. We propose to construct more call-graph edges by considering edges from 1) caller to long tail indirect callees (elaborated low) 2) caller to a large callee (in terms of function size) that’s not imported for a more complete call graph. Our experiments show it improves performance by ~+0.2% on internal Search workloads.
Background
Compiler places functions according to the call edge weight so that functions and its frequent callees are placed closer for better TLB efficiency (see RFC in [llvm-dev] [RFC] Profile guided section layout).
For direct calls, the edge weight is the FDO block count frequency of the callsite and accurate. For indirect calls, function target value profiling is used for weight calculation. Currently, given a callsite with caller foo
and callee bar
, the edge <foo, bar> is not accounted for if any of the following condition is met
- Target value not annotated on the edge
- This happens when bar is not one of top 3 indirect-call-promotion targets.
- Callee symbol not seen.
- CGProfile is a compiler pass and it constructs the symbol table from module IR. An edge might be missed when caller and callee are defined in two different modules, and callee is either not imported, or prematurely GC’ed by global-opt [1] or global-dce, which means the symbol table cannot see the function symbol and cannot construct the edges
As a result, the function bar
could appear to have zero incoming calls and be placed towards the end of the hot section, even if it’s a hot function with a large weight from foo
.
How to get these missed edges back
Value Profile Annotation
For instrumented FDO, we could annotate all indirect call target values. Similarly for AutoFDO, we could annotate all branch samples for an indirect call. Since indirect-call-promotion could look at the top 3 hottest targets, more annotations won’t cause regressions.
Support cross module function declaration import
This is a heavy-lifting step of this effort. Currently, function definitions are imported based on hotness and size; at a very high level, hotter functions have larger size threshold than colder functions; and size threshold exists mainly for compile time considerations without harming performance.
We propose supporting importing of function declarations to make sure function symbols are present for cross-module edge construction if the function definition is not imported for any other reason (e.g. a function is too large to be inlined). Compared with using a larger threshold to import function definitions, this could save a lot of compile time, and keeps the current postlink internalization as it is. Another potential use case of importing function declarations is to do speculative indirect-call-promotion on function declarations. Currently only function definitions are ICP’ed.
Preserving function declaration symbols for edge construction
Currently, global-opt and global-dce pass might clean up function declarations if they appear unused. And these two passes run before the call-graph profile pass which construct edges.
To preserve function declaration symbols but still having the necessary clean-ups from global-opt and global-dce, we may need to run both global-opt and global-dce twice; once before call-graph profile pass that preserves function declarations if they are indirect call targets, and once after call-graph profile pass to delete function declarations if necessary.
Prototyping Results
We prototyped this approach by having the more complete call graph edges. It shows a ~+0.2% QPS improvement on one internal search workload on both x86 and arm. The prototype doesn’t implement function declarations yet; instead, it prints semi-structured logs in the first FDO build, and consumes the log by repeating the FDO build for a quick validation of the idea.
[1] global-opt pass could clean up externally available functions, which is one type of discardableIfUnused functions.