“Contextual Profiling” keeps separate counter values based on the call path to a function. For event-driven software, like server-side services (which react to requests), interesting (for performance) paths start at the point worker threads call request handlers to process specific requests or continuations. Identifying these starts (“roots”) is out of scope, and it can be done a number of ways - manually, via sample-based profiling, or using knowledge of the request handling framework (“all implementers of this virtual method(s)”).
We plan to upstream an instrumentation[1] solution. We validated it on the search binary. At core, it is a relatively small change (few hundred lines of code in LLVM and compiler-rt, respectively), has competitive instrumented binary runtime characteristics, and produces manageable-sized profiles. A “demo”, or overview PR is available here (meaning, a PR that serves to illustrate the overall implementation but will be carved out into a succession of smaller changes). A high level overview and results obtained from the application to the search binary was given at EuroLLVM’24 - the slides are available here.
This RFC discusses details not covered in the high level overview linked above, including the profile use plan.
Scope
The current implementation is scoped to Linux x86. We plan to extend to Arm. There aren’t any known x86 vs Arm challenges, but we also haven’t done the work to validate that.
Instrumentation
LLVM
The LLVM changes are fairly straightforward, they consist of a new instrumentation intrinsic, llvm.instrprof.callsite, and a new lowering pass, PGOCtxProfLowering. Opting in to contextual instrumentation is done by using the LLVM flag -profile-context-root one or multiple times. It is used to specify the name of a function that constitutes an interesting entry point - once entered, we collect contextually for all the calls under it. The flag type is a list, so we may specify multiple such entry points by repeatedly passing it.
If opted in, the following compile-time changes happen:
- PGOInstrumentation
- will instrument the entry BB. This is because we use the entry BB to capture how often a function is called (in whichever context), which then determines indirect function calls frequency.
- will disable value profiling. This will be relaxed in later work, we just don’t need value profiling for indirect calls because we achieve that implicitly, but some other value profiling could be interesting. Also, the first immediate (planned) usecase is MLGO reward formulation for IPO, and that can be achieved without the non-indirect call side of value profiling. So the reasoning here is simply keeping the first batch of changes scoped.
- we insert the llvm.instrprof.callsite intrinsics above any “real” callsite (i.e. not asm, not intrinsics. The latter means, today, we lose mem* calls, like memcpy, but that’s addressable).
- PassManager
- use PGOCtxProfLowering instead of InstrProfilingLoweringPass
compiler-rt
Using the demo PR as reference, we added compiler-rt/lib/profile/InstrProfilingContextual.{h|cpp}. Technically this is stand-alone functionality, independent of the functionality in lib/profile, and depending only on sanitizer common functionality. The dependencies on lib/profile is debatable. At most, we’d depend on the destination file path logic. It is debatable because, for binaries that finish their main as part of expected operation - like clang - having the profile be auto-generated and saved at process exit is quite “natural”. That’s the opposite for server side, where main doesn’t meaningfully exit and we intentionally have a dedicated profile collection service (RPC endpoint), which can just pass the output profile path. However, we have not explored other event-driven scenarios, like client applications (browsers, phone apps). It may also make sense to have, there, the equivalent of the profile collection handler.
As it stands, we plan to introduce InstrProfilingContextual in its own peer directory to lib/profile and rely on a separate profile collection mechanism using explicit runtime APIs (instead of overloading __llvm_profile_reset and __llvm_profile_dump). A few reasons for that:
- avoid unnecessary legacy from the non-server style profile collection, for the server-side scenario
- we can avoid passing the output path through the compiler, and instead provide a callback. This works better for server side, where “local disks” may not be present, and instead a “buffer writer callback” would make more sense.
- we can expand a bit the API set to provide some telemetry as to “did the server warm up” (did the rate of new context discovery drop under a certain threshold). This can provide feedback to the profile collection service.
To be clear, there’s no reason not to eventually hook up the new functionality as an opt-in extension of lib/profile, but as a scoping exercise, we don’t plan to do that upfront.
Profile conversion tool: llvm-ifdo-ctx
We use bitstream for the profile format, because we want a semi-structured, nested, compact file format. However, taking a dependency from compiler-rt on bitstream writer is challenging. Instead, we dump the raw runtime memory buffers (arenas) we use for profile collection, with a bit of extra metadata to allow us to follow pointers correctly. We then post-process the profile with a tool to produce the bitstream. The assumption is that the postprocessing happens on a machine with the same endian-ness as the runtime that generated the profile - we believe this is a reasonable assumption, and it greatly simplifies the compiler-rt side of things.
The tool is added as a peer tool to the other profile tools, under llvm/tools. It currently uses “whatever made it work” values for the bitstream magic and subblock IDs, we’re looking for guidance on how to go about these values.
Note: a reasonable question would be “why not extend llvm-profdata”. The reason is that the implementation is very different.
An Example
Refer to the demo PR, the compiler-rt/test/profile/Linux/contextual.cpp test. Note how we compile (with instrumentation) the .cpp file - we specify the_root is an entrypoint. When we run the instrumented binary, all functions during startup, including main, call __llvm_instrprof_get_context, which notices there is no context and returns the scratch context, meaning their counter updates are done in the same buffer - effectively discardable. When main calls the_root, the latter will call __llvm_instrprof_start_context, which will set up the compiler-profvided ContextRoot and install it on the TLS __llvm_instrprof_current_context_root. The implementation of printf and someFunction will observe it and the implementation of __llvm_instrprof_get_context will advance to the point it realizes it needs to bump-allocate a context (by calling __llvm_instrprof_slow_get_callsite). That happens in each case - even for the second call to someFunction - because the call is at a different callsite.
If we called the_root again from main, __llvm_instrprof_start_context would detect it has already set up the needed arenas and the first context and exit quickly, returning the root context, and each of the calls to __llvm_instrprof_get_context made from the_root would return the already allocated context promptly (they’d enter this loop exactly once, coming out with a valid Callsite).
Measurements
See the EuroLLVM’24 presentation. In particular, on server side, the technique results in faster instrumented binaries, by a few factors (4-5x) across all latency-related metrics, than the current iFDO, because it avoids shared writes. This is also the reason sampling iFDO achieves better performance. We did not compare them as the latter hasn’t landed yet. We expect the latter to have a performance edge over contextual instrumentation - simply because it does less. The plan is to eventually merge the 2, implementation-wise.
Binary size is relatively unchanged, because we do not have pre-allocated counters, but we do bloat .text (currently, by ~43%) and that bloat ends up pretty much cancelling out the savings due to there being no pre-allocated counters. This .text bloat is likely to be ameliorated in the future but unlikely to be completely eliminated.
Runtime memory overhead is relatively low, and mostly dominated by the thread local scratch context buffers, which we currently allocated statically (and pessimistically) at 1MB.
Profile use
Not shown in the demo PR, we plan to add:
- a context profile reader - this is a reader of the bitstream format that creates the analogous tree structure. The latter is very similar in spirit to what CSSPGO needs to maintain, but the differences - for example, counters vs samples - make reuse here appear impractical.
- an immutable analysis maintaining the context and exposing APIs for reacting (updating) in the face of indirect call promotion and inlining. We propose this particular separation of concerns to an alternative where passes would carry out the update explicitly themselves, in the hope that we could eventually share the same interface with the sample-based profile, which internally may look different.
- currently, we assume thinlto. There is no reason not to, server-side performance optimization without it makes little sense, and this assumption simplifies the rest of the work.
- a new flag informing that the profile being ingested is contextual. The changes below are controlled by this flag
- The pass manager will stop compiling right after the instrumentation pass, in the pre-thin link compile. This is to avoid the need to update the context profile before the thin link, and to not break our ability to match counters in the thinlto backends
- thinlto link will also ingest the contextual profile and treat it as a workload definition (PR #74545) - importing all the functions under a context root in the module defining the latter. Note that the PR currently uses function names, not guids - that will be extended accordingly.
- post-thinlto compile will perform first the IPO transformations, and only then function simplification and optimization. We leverage ModuleInliner for that. We are not very concerned with such a change causing regressions, we validated it is a performance noop for e.g. the search binary. We’ll investigate regressions, should they appear.
- the rationale for this is that maintaining the contextual profile through a mix of IPO and function simplification is quite tricky. Instead - and akin how CSSPGO happens in the SampleLoader - we frontload IPO and make IPO transformations aware of the contextual profile, which they’ll maintain. Progressively, we can bring more passes to this “contextual awareness”, as necessary. As we want to do that through an abstraction offered by the immutable analysis, CSSPGO could transparently leverage this possibility, too.
- post-IPO, a flattener. As there won’t be any more IPO, there’s no further value in contextualization. We flatten the profile back to current MD_prof. This will also require a small refactoring in PGOInstrumentation to reuse the current counter → MD_prof lowering.
This plan just lays the foundation for using the contextual instrumented profile. A even later step will start exploring using such a profile for better inliner (i.e. [Module]Inliner.cpp) decisions, and explore other IPO that contextualization empowers, like function outlining, function splitting, etc.
Alternative Instrumentation-Based Approaches and Trade-offs
CSFDO
CSFDO consists in 2 instrumentation phases: the first one is the normal iFDO, happening before ThinLTO link. After collecting that profile, we don’t build an optimized binary, instead we build with it an instrumented binary, but the second instrumentation happens post-Thin Link, and after post-Thin Link inlining. All prior optimizations and ThinLTO linking are driven by the first profile.
Compared to CSFDO, the contextual iFDO technique requires only one instrumentation run and exposes profile contextualization during inlining and other IPO, opening the door for better IPO decisions.
Counter pre-allocation
A continuation of the current iFDO, this would require pre-allocating per-path counters. Naively, this would result in a prohibitively large number of counters, but a more nuanced approach is possible if performed during thinlto and counters were assigned to paths rather than edges. For server-side, it would still require some server-oriented intelligence in identifying where the paths start - which could be fixed at pthread_start. The approach would be quite complex, though - non-trivial thinlto changes, and still a potentially large counter pre-allocation. We believe that “entrypoint awareness” is a concept worth introducing to compiling server-side (and, more generally, event-driven) software, outside of the profiling scenario, and observe that the resulting design simplification is quite significant for the profiling case.
Stack walking
Contexts could be determined by walking up the stack, but:
- that’s expensive (O(depth of stack) for each call site rather than O(1) in our approach)
- it complicates IPO in the instrumented binary - for instance, we would need expand stack frames with inlining info - which is not a limitation of our approach.
- it leaves open when counters get allocated. If that happens ahead of time, we hit the same problem above - explosion of counters. If it happens at runtime, then arguably the describes solution is simpler. \
Tracing (xray)
We could use an evolution of xray, whereby we trace events like function calls and then reconstitute the call graph after. This was actually an earlier idea we explored, but after prototyping (and also an opportunistic idea exchange with rnk@ - thanks!) decided to abandon it in favor of maintaining the context tree, as described here. The main liability of tracing would be the size of the trace and the need to periodically flush, the latter likely having a comparable, if not higher, cost to our current runtime overhead. In contrast, our approach achieves “infinite” compression at steady-state, i.e. as long as no new paths are discovered, the memory overhead remains unchanged (because only the counters get updated).
Future work
Collect flat profiles, too
The idea (pointed out also by others, e.g. minglotus-6) is to “why not use those scratch buffers for something useful”. One immediate way is to combine with the current iFDO and return those pre-allocated buffers instead (note the implementation identifies “scratch”-ness by tainting the pointer, i.e. any memory buffer(s) could play “scratch” role). Another possibility is to dynamically allocate that memory instead of pre-allocating it, and only do so after profiling starts - because most server side code is cold or uninteresting (like startup code). We plan to explore this after the initial patches land.
Optimize the .text size
Right now the .text size bloats up to pretty much cancel out the size savings due to there not being any pre-allocated counters. The main reason is the calls to the runtime and the callsite instrumentation. We can probably reduce that, the current implementation made no attempt at doing so. We could also look more critically at the preinliner and potentially make it more aggressive. Finally, we could incorporate ideas such as path profiling <link> which should reduce the number of counter increments.
Automatic entrypoint detection
We could consider discovering (approximating) entrypoints by first observing, during profile collection, for a short period, what threads do and finding “watermark(s)” - functions under which most work happens; we might explore that later, however we believe there is value in intentionally expressing what the important workloads are.
Other kinds of counters
Note that the design really just requires the Context object have a header and a list of children (the callee contexts). The counter payload can be further differentiated to support, for example, value profiling.