RFC: Temporal Profiling Extension for IRPGO

Hi all,

Our team at Meta would like to propose an extension to IRPGO called Temporal Instrumentation. We use a similar technology to significantly improve mobile app cold start by reducing .text section page faults. We would like to get feedback on this work so we can upstream it to LLVM. Please let us know if you have any questions or have ideas for improvements.

Thanks,

Ellis, Kyungwoo, YongKang, Sergey, Julián, Wenlei

Motivation

PGO has been largely successful at improving performance, especially in the server space. Profiled binaries are run under realistic workloads to collect or sample execution counts. Then hot functions are inlined to reduce call overhead and improve performance. In the mobile space, however, most apps are not CPU-bound, so inlining does not usually improve performance. In fact, inlining can increase code size which actually hurts performance because larger binaries tend to have more page faults, which causes the app to take longer to startup.

On the other hand, changing the order of functions laid out in the binary has a negligible effect on binary size (although it could impact compressed size), but it could greatly improve (startup) performance. For example, by laying out functions executed during startup close together in the binary, we reduce .text section memory fragmentation and thus reduce page faults during startup. Of course it may not be obvious which functions are involved in startup, so we need to somehow instrument the temporal behavior of these functions.

Recently, our team has worked on two related projects in the area, described in the following papers: Efficient Profile-Guided Size Optimization for Native Mobile Applications and Optimizing Function Layout for Mobile Applications. The first introduces a lightweight instrumentation that can be used to produce a function order that reduces page faults during startup for mobile applications. The second improves upon the algorithm to further reduce page faults. When combined, these works reduce the number of page faults in the first 10 seconds by 20%-35% in our tests.

The goal of this RFC is to propose an extension to IRPGO called Temporal Instrumentation. We expect that our initial function order optimizations will achieve results similar to the work mentioned above with hopefully more optimizations to come. We have a proof of concept on this branch in github in progress and we will start uploading patches if this RFC gets positive feedback.

Design

For a single profiled run, we would like to collect function timestamps and combine them into a function trace. A function timestamp is the timestamp for when the function was executed for the first time and a function trace is a set of functions ordered by their timestamps from a single run. To simplify the explanation, we have an example implementation below. Each function maintains a unique timestamp variable that is initially zero. When that function is executed for the first time, we increment a global timestamp and store it to the local timestamp.

int global_timestamp = 1;

// Run at the entry of each instrumented function.
if (*timestamp == 0)
  *timestamp = global_timestamp++;

Since we are interested in using this instrumentation mode for mobile apps, even instrumented binary size is a concern. We want this mode to be compatible with some features we announced in this Lightweight Instrumentation RFC, namely Debug Info Correlation and Minimal Block Coverage. This means we need to store the function’s local timestamp in the __llvm_prf_cnts data section so that it is compatible with Debug Info Correlation. There actually already exists something similar that Manman Ren upstreamed that is enabled with -enable-order-file-instrumentation. Unfortunately this is not integrated with the rest of IRPGO or Lightweight Instrumentation. Temporal Profiling is as lightweight as -enable-order-file-instrumentation, but has the benefit of reusing PGOGen, the indexed profile format, and the llvm-profdata tool.

PGOGen

We will need to create a new intrinsic to implement this extension:

llvm.instrprof.timestamp(ptr <name>, i64 <hash>, i32 <num-counters>, i32 <index>)

This is similar to llvm.instrprof.increment except that it lowers to different code during PGOGen. To reduce binary size overhead, the timestamp intrinsic is lowered to a call out to the profile runtime which conditionally sets the local timestamp as described above.

Profiles

When timestamp instrumentation is used, the first four bytes of a function’s __llvm_prf_cnts section is its local timestamp and the function’s counters are found in the rest of the section like normal. This can be implemented with a few changes to the profile reader code and the runtime. To detect whether timestamp instrumentation is used, we need a new VARIANT_MASK_TEMPORAL_INSTR bit in the version field. Unfortunately the number of bits allocated in VARIANT_MASKS_ALL is (or soon will be) exhausted so we may need to add an extra four bits.

Indexed Profiles

For each profile, we can combine function timestamps to get a function trace which we can store in a new section in the indexed profile. The size of this new section is O(#functions * #traces) which can be quite large when merging thousands of profiles for a large binary. Luckily, since we are focused on optimizing page faults during startup, the most important functions are at the beginning of the trace so we can truncate the number of functions in each trace to N. From local testing, we found that only a small sample of collected traces are needed to effectively optimize function order. If we keep a sample of M traces in the indexed profile, the size overhead goes down to O(N * M) which is much more reasonable.

We would like to avoid changing the process of merging profiles when using timestamp instrumentation. Unfortunately, since the llvm-profdata tool can merge different sets of profiles in different commands, this makes it difficult to collect a random sample of traces. We must use reservoir sampling to correctly maintain a sample of traces in the indexed profile. To understand this, we must introduce a few variables: NumTraces is the number of traces currently stored in the indexed profile, TraceStreamSize is the total number of traces we have seen so far, and TraceReservoirSize is the maximum number of traces we’d like to store. Put simply, as traces are added past TraceReservoirSize, we randomly choose a trace to replace with a probability based on TraceStreamSize. For determinism, we use a fixed value to seed the RNG.

In the end, the new section will look like the following:

FunctionTracesSection {
  i64 NumTraces;
  i64 TraceStreamSize;
  Trace[NumTraces] {
    i64 NumFunctions;
    i64[NumFunctions] NameRef;
  };
};

Computing & Enforcing Symbol Order

In our function layout paper we describe an algorithm that computes a function order that reduces page faults during startup using a sample of traces. This function order can easily be passed to the linker using -symbol-ordering-file on ELF or -order-file on Mach-O.

2 Likes

I’ve just published ⚙ D147287 [InstrProf] Temporal Profiling for review and I’d love to get feedback!

1 Like

Some high level questions:

  1. What is missing in the existing enable-order-file-instrumentation?
  2. For the profile merging in the index profile, why is there a need to keep multiple traces? Should the final ordering decision (a reduction) be made and store only the final layout order? I understand the need to keep the the original data as much as possible, but the sampling process is already not lossless.

The major advantage is the integration into the existing InstrProf flow. It is compatible with Lightweight Instrumentation to provide smaller binary size overhead, it abides by the noprofile and skipprofile attributes, and llvm-profdata is used to merge raw profiles and consume indexed profiles the same way for all instrumentation modes. I believe this is an improvement over learning how to use the existing enable-order-file-instrumentation.

There are also some implementation differences. enable-order-file-instrumentation maintains a large buffer and atomically appends function name hashes as functions are called while this work reuses function-local profile data and writes a timestamp when functions are called. This results in less codesize overhead and probably less runtime overhead.

In Optimizing Function Layout for Mobile Applications we describe an algorithm to compute a function order that both minimizes page faults during startup and reduces compressed binary size. To run, it needs a sample of function traces and the contents of the functions themselves, and I hope to eventually run this in the compiler. The other difficulty is that the llvm-profdata tool supports merging raw profiles in multiple invocations, so the .profdata format would need to store individual traces anyway, even it we did store the final function order in the .profdata file.

GCC’s timing profiling looks similar to this proposal and it is lightweight. Is there a plan to deprecate the order-file instrumentation once this is in? It is desirable to minimize things with overlapping functionalities.

Can you point me to GCC’s “timing profiling”?

I can work with @manman-ren to deprecate order-file instrumentation. This work supports the same functionality so users should be able to switch over to Temporal Instrumentation once this is done.

I would like to mention that I recently landed ⚙ D147812 [InstrProf] Use BalancedPartitioning to order temporal profiling trace data which implements the balanced partitioning algorithm described in Optimizing Function Layout for Mobile Applications. With this work, we can use a profile with temporal traces to compute a function order using the llvm-profdata tool.

llvm-profdata order default.profdata -o order.txt

Then order.txt can be passed to the linker to layout functions using -symbol-ordering-file on ELF or -order-file on Mach-O.

Edit: Updated link to paper.

4 Likes

I would also like to mention that I recently gave a talk at EuroLLVM 2024 on this topic:

(For some reason this was also uploaded at https://www.youtube.com/watch?v=RFEhtAqedcQ and I’ve posted about that in Duplicate EuroLLVM YouTube Uploads)

And I have at PR to add flags to order sections in the linker to improve startup and compressed size: