[RFC] For better call-graph sort, build a more complete call-graph by adding more indirect call edges

Overview

For profile-guided call graph sort, graph edges are constructed by call graph profile pass and used by linker to sort call graphs. We propose to construct more call-graph edges by considering edges from 1) caller to long tail indirect callees (elaborated low) 2) caller to a large callee (in terms of function size) that’s not imported for a more complete call graph. Our experiments show it improves performance by ~+0.2% on internal Search workloads.

Background

Compiler places functions according to the call edge weight so that functions and its frequent callees are placed closer for better TLB efficiency (see RFC in [llvm-dev] [RFC] Profile guided section layout).

For direct calls, the edge weight is the FDO block count frequency of the callsite and accurate. For indirect calls, function target value profiling is used for weight calculation. Currently, given a callsite with caller foo and callee bar, the edge <foo, bar> is not accounted for if any of the following condition is met

  • Target value not annotated on the edge
    • This happens when bar is not one of top 3 indirect-call-promotion targets.
  • Callee symbol not seen.
    • CGProfile is a compiler pass and it constructs the symbol table from module IR. An edge might be missed when caller and callee are defined in two different modules, and callee is either not imported, or prematurely GC’ed by global-opt [1] or global-dce, which means the symbol table cannot see the function symbol and cannot construct the edges

As a result, the function bar could appear to have zero incoming calls and be placed towards the end of the hot section, even if it’s a hot function with a large weight from foo.

How to get these missed edges back

Value Profile Annotation

For instrumented FDO, we could annotate all indirect call target values. Similarly for AutoFDO, we could annotate all branch samples for an indirect call. Since indirect-call-promotion could look at the top 3 hottest targets, more annotations won’t cause regressions.

Support cross module function declaration import

This is a heavy-lifting step of this effort. Currently, function definitions are imported based on hotness and size; at a very high level, hotter functions have larger size threshold than colder functions; and size threshold exists mainly for compile time considerations without harming performance.

We propose supporting importing of function declarations to make sure function symbols are present for cross-module edge construction if the function definition is not imported for any other reason (e.g. a function is too large to be inlined). Compared with using a larger threshold to import function definitions, this could save a lot of compile time, and keeps the current postlink internalization as it is. Another potential use case of importing function declarations is to do speculative indirect-call-promotion on function declarations. Currently only function definitions are ICP’ed.

Preserving function declaration symbols for edge construction

Currently, global-opt and global-dce pass might clean up function declarations if they appear unused. And these two passes run before the call-graph profile pass which construct edges.

To preserve function declaration symbols but still having the necessary clean-ups from global-opt and global-dce, we may need to run both global-opt and global-dce twice; once before call-graph profile pass that preserves function declarations if they are indirect call targets, and once after call-graph profile pass to delete function declarations if necessary.

Prototyping Results

We prototyped this approach by having the more complete call graph edges. It shows a ~+0.2% QPS improvement on one internal search workload on both x86 and arm. The prototype doesn’t implement function declarations yet; instead, it prints semi-structured logs in the first FDO build, and consumes the log by repeating the FDO build for a quick validation of the idea.

[1] global-opt pass could clean up externally available functions, which is one type of discardableIfUnused functions.

@davidxl @teresajohnson @rlavaee who has seen a draft of this.

Can the threshold be a hotness cutoff rather than a constant (like 3) cutoff?

Can the threshold be a hotness cutoff rather than a constant (like 3) cutoff?

Thanks for taking a read! Would you elaborate a little bit on the scenarios where a hotness cutoff is preferred? From graph construction perspective, lukewarm edges (i.e., the edge count is not hot according to FDO hotness cutoff) makes much smaller difference in the order of functions, but overall annotating more edges from indirect call value profiles gives better parity between indirect calls and direct calls.

IIUC (and correct me if I’m wrong), the SampleFDO/instrumented FDO you pointed at only annotates the hottest 3 target functions per indirect call. You suggested annotating all target functions, but can we instead only annotate hot targets based on some hotness threshold? e.g. for cold indirect calls, there’s no need to annotate any target functions since it doesn’t matter, but for a hot indirect call with 5 hot or lukewarm targets and 5 cold targets, we should annotate all 5 hot/lukewarm targets and none of the cold targets?

Discarding the cold edges likely doesn’t affect the outcome of function layout, since functions are placed in a straight line. But if annotating more edges doesn’t have unintended side-effects (e.g., binary size, etc), the sorting algorithm could decide how to use the edge. For instance, the hfsort discard edges cross ELF sections

If there are no downsides (e.g. compile time or the other things you mentioned), then sgtm.

I’ve always wondered if importing function definitions with attributes inferred pre-link would be useful for optimizing based on call attributes (accounting for GlobalValue::mayBeDerefined(), which internalization may help with e.g. if we can change linkonce_odr to external linkage?). Would this help with that?

To log our discussions as a fyi for others who might read this in the future,

The following example from @aeubanks is another potential use case of symbol declaration import

  1. Compiler Explorer shows a function call without side-effects could be simplified away if a result is not used.

  2. With cross-module references, the function is still there, even if it could be simplified away if a symbol with attributes are in the caller module.

cat /tmp/y.c
void f();
int main() {
    f();
}
$ cat /tmp/z.c
void f() {}
$ clang -O1 -flto=thin -Wl,-mllvm,-import-instr-limit=0 /tmp/y.c /tmp/z.c -fuse-ld=lld -Wl,--threads=1 -Wl,-mllvm,-print-changed

Re “accounting for GlobalValue::mayBeDerefined() , which internalization may help with e.g. if we can change linkonce_odr to external linkage?”, importing declarations with attribute doesn’t help with internalization of linkonce_odr functions as we discussed.

On a second thought of this, ThinLTO imports non-cold functions with an instr limit of 100 by default, and cold functions with an instr limit of 0. Some functions with no side-effects might be small enough to make it through the current importing threshold currently.