Hi all,
Our team at Meta would like to propose an extension to IRPGO called Temporal Instrumentation. We use a similar technology to significantly improve mobile app cold start by reducing .text
section page faults. We would like to get feedback on this work so we can upstream it to LLVM. Please let us know if you have any questions or have ideas for improvements.
Thanks,
Ellis, Kyungwoo, YongKang, Sergey, Julián, Wenlei
Motivation
PGO has been largely successful at improving performance, especially in the server space. Profiled binaries are run under realistic workloads to collect or sample execution counts. Then hot functions are inlined to reduce call overhead and improve performance. In the mobile space, however, most apps are not CPU-bound, so inlining does not usually improve performance. In fact, inlining can increase code size which actually hurts performance because larger binaries tend to have more page faults, which causes the app to take longer to startup.
On the other hand, changing the order of functions laid out in the binary has a negligible effect on binary size (although it could impact compressed size), but it could greatly improve (startup) performance. For example, by laying out functions executed during startup close together in the binary, we reduce .text
section memory fragmentation and thus reduce page faults during startup. Of course it may not be obvious which functions are involved in startup, so we need to somehow instrument the temporal behavior of these functions.
Recently, our team has worked on two related projects in the area, described in the following papers: Efficient Profile-Guided Size Optimization for Native Mobile Applications and Optimizing Function Layout for Mobile Applications. The first introduces a lightweight instrumentation that can be used to produce a function order that reduces page faults during startup for mobile applications. The second improves upon the algorithm to further reduce page faults. When combined, these works reduce the number of page faults in the first 10 seconds by 20%-35% in our tests.
The goal of this RFC is to propose an extension to IRPGO called Temporal Instrumentation. We expect that our initial function order optimizations will achieve results similar to the work mentioned above with hopefully more optimizations to come. We have a proof of concept on this branch in github in progress and we will start uploading patches if this RFC gets positive feedback.
Design
For a single profiled run, we would like to collect function timestamps and combine them into a function trace. A function timestamp is the timestamp for when the function was executed for the first time and a function trace is a set of functions ordered by their timestamps from a single run. To simplify the explanation, we have an example implementation below. Each function maintains a unique timestamp variable that is initially zero. When that function is executed for the first time, we increment a global timestamp and store it to the local timestamp.
int global_timestamp = 1;
// Run at the entry of each instrumented function.
if (*timestamp == 0)
*timestamp = global_timestamp++;
Since we are interested in using this instrumentation mode for mobile apps, even instrumented binary size is a concern. We want this mode to be compatible with some features we announced in this Lightweight Instrumentation RFC, namely Debug Info Correlation and Minimal Block Coverage. This means we need to store the function’s local timestamp in the __llvm_prf_cnts
data section so that it is compatible with Debug Info Correlation. There actually already exists something similar that Manman Ren upstreamed that is enabled with -enable-order-file-instrumentation
. Unfortunately this is not integrated with the rest of IRPGO or Lightweight Instrumentation. Temporal Profiling is as lightweight as -enable-order-file-instrumentation
, but has the benefit of reusing PGOGen
, the indexed profile format, and the llvm-profdata
tool.
PGOGen
We will need to create a new intrinsic to implement this extension:
llvm.instrprof.timestamp(ptr <name>, i64 <hash>, i32 <num-counters>, i32 <index>)
This is similar to llvm.instrprof.increment
except that it lowers to different code during PGOGen
. To reduce binary size overhead, the timestamp intrinsic is lowered to a call out to the profile runtime which conditionally sets the local timestamp as described above.
Profiles
When timestamp instrumentation is used, the first four bytes of a function’s __llvm_prf_cnts
section is its local timestamp and the function’s counters are found in the rest of the section like normal. This can be implemented with a few changes to the profile reader code and the runtime. To detect whether timestamp instrumentation is used, we need a new VARIANT_MASK_TEMPORAL_INSTR
bit in the version field. Unfortunately the number of bits allocated in VARIANT_MASKS_ALL
is (or soon will be) exhausted so we may need to add an extra four bits.
Indexed Profiles
For each profile, we can combine function timestamps to get a function trace which we can store in a new section in the indexed profile. The size of this new section is O(#functions * #traces)
which can be quite large when merging thousands of profiles for a large binary. Luckily, since we are focused on optimizing page faults during startup, the most important functions are at the beginning of the trace so we can truncate the number of functions in each trace to N
. From local testing, we found that only a small sample of collected traces are needed to effectively optimize function order. If we keep a sample of M
traces in the indexed profile, the size overhead goes down to O(N * M)
which is much more reasonable.
We would like to avoid changing the process of merging profiles when using timestamp instrumentation. Unfortunately, since the llvm-profdata
tool can merge different sets of profiles in different commands, this makes it difficult to collect a random sample of traces. We must use reservoir sampling to correctly maintain a sample of traces in the indexed profile. To understand this, we must introduce a few variables: NumTraces is the number of traces currently stored in the indexed profile, TraceStreamSize is the total number of traces we have seen so far, and TraceReservoirSize is the maximum number of traces we’d like to store. Put simply, as traces are added past TraceReservoirSize, we randomly choose a trace to replace with a probability based on TraceStreamSize. For determinism, we use a fixed value to seed the RNG.
In the end, the new section will look like the following:
FunctionTracesSection {
i64 NumTraces;
i64 TraceStreamSize;
Trace[NumTraces] {
i64 NumFunctions;
i64[NumFunctions] NameRef;
};
};
Computing & Enforcing Symbol Order
In our function layout paper we describe an algorithm that computes a function order that reduces page faults during startup using a sample of traces. This function order can easily be passed to the linker using -symbol-ordering-file
on ELF or -order-file
on Mach-O.