[RFC] Order File Instrumentation

Order file is used to teach ld64 how to order the functions in a binary. If we put all functions executed during startup together in the right order, we will greatly reduce the page faults during startup.

To generate order file for iOS apps, we usually use dtrace, but some apps have various startup scenarios that we want to capture in the order file. dtrace approach is not easy to automate, it is hard to capture the different ways of starting an app without automation. Instrumented builds however can be deployed to phones and profile data can be automatically collected.

For the Facebook app, by looking at the startup distribution, we are expecting a big win out of the order file instrumentation, from 100ms to 500ms+, in startup time.

The basic idea of the pass is to use a circular buffer to log the execution ordering of the functions. We only log the function when it is first executed. Instead of logging the symbol name of the function, we log a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module.

In this pass, we add three global variables:
(1) an order file buffer
The order file buffer is a circular buffer at its own llvm section. Each entry is a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module.
(2) a bitmap for each module: one bit for each function to say if the function is already executed;
(3) a global index to the buffer

At the function prologue, if the function has not been executed (by checking the bitmap), log the module id and the function id, then atomically increase the index.

This pass is intended to be used as a ThinLTO pass or a LTO pass. It maps each module to a distinct integer, it also generate a mapping file so we can decode the function symbol name from the pair of ids.

clang has ‘-finstrument-function-entry-bare’ which inserts a function call and is not as efficient.

Three patches are attached, for llvm, clang, and compiler-rt respectively.

TODO:
(1) Migrate to the new pass manager with a shim for the legacy pass manager.
(2) For the order file buffer, consider always emitting definitions, making them LinkOnceODR with a COMDAT group.
(3) Add testing case for clang/compiler-rt patches.
(4) Add utilities to deobfuscate the profile dump.
(5) The size of the buffer is currently hard-coded (INSTR_ORDER_FILE_BUFFER_SIZE).

Thanks Kamal for contributing to the patches! Thanks to Aditya and Saleem for doing an initial review pass over the patches!

Manman

orderfile-llvm.patch (16 KB)

orderfile-rt.patch (6.98 KB)

orderfile-clang.patch (1.46 KB)

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Another possibility is to use xray to implement the functionality – xray is useful for trace like profiling by design.

David

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

So the difference will be where we store the profile information and in what format?

With the suggested approach, we need to allocate one time stamp for each function, what is implemented is a pair of numbers for each executed function. The runtime performance can be different as well, the suggested approach gets the time stamp, and saves it to memory, what is implemented is saving the pair of numbers and incrementing a counter.

Another possibility is to use xray to implement the functionality – xray is useful for trace like profiling by design.

We have not looked into XRay. We need something with low binary size penalty and low runtime perf degradation, not sure if XRay is a good fit!

Thanks,
Manman

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

There is no needed to define the counter explicitly. What is needed is to introduce a new intrinsic to update the order counter, and the InstProf lowerer will create the counter for you. See Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar

So the difference will be where we store the profile information and in what format?

Time stamp can be simply a uint64 value just like any edge/block profile count. llvm_profdata tool can be taught to dump the ordering file to be exported to the linker.

With the suggested approach, we need to allocate one time stamp for each function, what is implemented is a pair of numbers for each executed function. The runtime performance can be different as well, the suggested approach gets the time stamp, and saves it to memory, what is implemented is saving the pair of numbers and incrementing a counter.

The runtime cost is probably not much for either approach. The suggested approach eliminates the trouble to maintain/implement a different way to identify functions.

Another possibility is to use xray to implement the functionality – xray is useful for trace like profiling by design.

We have not looked into XRay. We need something with low binary size penalty and low runtime perf degradation, not sure if XRay is a good fit!

xray can achieve very low runtime overhead. +Dean Michael Berris for additional comments.

David

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

There is no needed to define the counter explicitly. What is needed is to introduce a new intrinsic to update the order counter, and the InstProf lowerer will create the counter for you. See Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar

We need the instrumentation to be run late as an IR pass, so it has all the function symbols. Are you suggesting a front-end instrumentation by mentioning InstrProfiling.cpp?

So the difference will be where we store the profile information and in what format?

Time stamp can be simply a uint64 value just like any edge/block profile count. llvm_profdata tool can be taught to dump the ordering file to be exported to the linker.

With the suggested approach, we need to allocate one time stamp for each function, what is implemented is a pair of numbers for each executed function. The runtime performance can be different as well, the suggested approach gets the time stamp, and saves it to memory, what is implemented is saving the pair of numbers and incrementing a counter.

The runtime cost is probably not much for either approach. The suggested approach eliminates the trouble to maintain/implement a different way to identify functions.

Agree that using existing infra is better!

Manman

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

There is no needed to define the counter explicitly. What is needed is to introduce a new intrinsic to update the order counter, and the InstProf lowerer will create the counter for you. See Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar

We need the instrumentation to be run late as an IR pass, so it has all the function symbols. Are you suggesting a front-end instrumentation by mentioning InstrProfiling.cpp?

The lowering is actually shared by both FE instrumentation and IR PGO. For ordering profiling, it needs to run the pass after the inliner pass in theory (which is later even than the current IR PGO instrumentation pass). One way to achieve the same effect is lower the time stamp counter update intrinsic in a separate pass (after inlining). In the lowerer, the counter update intrinsic can be eliminated if it is in an inline instance, but this is not ideal either as the inlining decisions may be affected due to instrumentation.

To combine the ordering profiling requirements (post inlining) and the benefit of sharing (runtime, tooling etc), the proposed implementation can be modified with the following changes:
0) keep the phase ordering in the patch

  1. the instrumentation just does insert intrinsics
  2. enhance lowering to handle new intrinsics.

I assume ordering profiling is not intended to work together with PGO instrumentation (to make sure inlining decisions are the same), so the you end up with a counter section with one counter per out of line function. The rest will just work. The changes to runtime will be minimal.

If PGO (IR or FE) + Ordering profiling is a desired use case, it can be made working too – the lowerer needs to update the per function counter array size (appending one).

David

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

There is no needed to define the counter explicitly. What is needed is to introduce a new intrinsic to update the order counter, and the InstProf lowerer will create the counter for you. See Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar

We need the instrumentation to be run late as an IR pass, so it has all the function symbols. Are you suggesting a front-end instrumentation by mentioning InstrProfiling.cpp?

The lowering is actually shared by both FE instrumentation and IR PGO. For ordering profiling, it needs to run the pass after the inliner pass in theory (which is later even than the current IR PGO instrumentation pass). One way to achieve the same effect is lower the time stamp counter update intrinsic in a separate pass (after inlining). In the lowerer, the counter update intrinsic can be eliminated if it is in an inline instance, but this is not ideal either as the inlining decisions may be affected due to instrumentation.

To combine the ordering profiling requirements (post inlining) and the benefit of sharing (runtime, tooling etc), the proposed implementation can be modified with the following changes:
0) keep the phase ordering in the patch

  1. the instrumentation just does insert intrinsics
  2. enhance lowering to handle new intrinsics.

I assume ordering profiling is not intended to work together with PGO instrumentation (to make sure inlining decisions are the same), so the you end up with a counter section with one counter per out of line function. The rest will just work. The changes to runtime will be minimal.

Got it, this should work! The overhead will be one counter for each function, and we also need to save the function names in the instrumented binary, right?
Other than binary size, the other concern is the size of the profile data, as we need to download it from device. The profile data will be one counter for each function plus all the function names?

In the proposed implementation, the profile data will be the size of the circular buffer, which should be the maximal number of functions that can be executed during startup.

If PGO (IR or FE) + Ordering profiling is a desired use case, it can be made working too – the lowerer needs to update the per function counter array size (appending one).

There may be a use case for IR Instr + ordering profiling. One important usage of IR Instr is to guide hot/cold splitting for startup time.
It is always simpler if we only need to maintain one instrumentation build! But that depends on how much overhead IR Instr adds in terms of binary size, profile data size, and runtime perf.

Manman

Hi Manman,

Ordering profiling is certainly something very useful to have to startup time performance. GCC has something similar.

In terms of implementation, it is possible to simply extend the edge profiling counters by 1 for each function, and instrument the function to record the time stamp the first time the function is executed. The overhead will be minimized and you can leverage all the other existing support in profiling runtime.

Hi David,

Just to clarify, are you suggesting to add an edge profiling counter per function to record the time stamp? Where are the edge profiling counters defined?

There is no needed to define the counter explicitly. What is needed is to introduce a new intrinsic to update the order counter, and the InstProf lowerer will create the counter for you. See Transforms/Instrumentation/InstrProfiling.cpp. +Vedant Kumar

We need the instrumentation to be run late as an IR pass, so it has all the function symbols. Are you suggesting a front-end instrumentation by mentioning InstrProfiling.cpp?

The lowering is actually shared by both FE instrumentation and IR PGO. For ordering profiling, it needs to run the pass after the inliner pass in theory (which is later even than the current IR PGO instrumentation pass). One way to achieve the same effect is lower the time stamp counter update intrinsic in a separate pass (after inlining). In the lowerer, the counter update intrinsic can be eliminated if it is in an inline instance, but this is not ideal either as the inlining decisions may be affected due to instrumentation.

To combine the ordering profiling requirements (post inlining) and the benefit of sharing (runtime, tooling etc), the proposed implementation can be modified with the following changes:
0) keep the phase ordering in the patch

  1. the instrumentation just does insert intrinsics
  2. enhance lowering to handle new intrinsics.

I assume ordering profiling is not intended to work together with PGO instrumentation (to make sure inlining decisions are the same), so the you end up with a counter section with one counter per out of line function. The rest will just work. The changes to runtime will be minimal.

Got it, this should work! The overhead will be one counter for each function, and we also need to save the function names in the instrumented binary, right?

There is no need to save function names – it is handled by existing lowerer and runtime. That is why the size overhead is larger than one counter per function. The static function names are also qualified with the enclosing module name.

Other than binary size, the other concern is the size of the profile data, as we need to download it from device. The profile data will be one counter for each function plus all the function names?

There is per-function meta data recorded in llvm_prf_data section. Function names are compressed and stored in the names section. We don’t have it now, but it is possible to add an option to not load the name section into memory at all, but the llvm_profdata tool needs to be taught to load the name section from the executable (as the profile data file won’t have the name section).

If PGO can be used on the device, then the size overhead should not be a concern here – otherwise we have an usability issue with PGO here.

It is also reasonable to assume that instrumented builds are not intended to be run on low-end devices with small memory (which is unlike release builds).

In the proposed implementation, the profile data will be the size of the circular buffer, which should be the maximal number of functions that can be executed during startup.

If PGO (IR or FE) + Ordering profiling is a desired use case, it can be made working too – the lowerer needs to update the per function counter array size (appending one).

There may be a use case for IR Instr + ordering profiling. One important usage of IR Instr is to guide hot/cold splitting for startup time.
It is always simpler if we only need to maintain one instrumentation build! But that depends on how much overhead IR Instr adds in terms of binary size, profile data size, and runtime perf.

This use case makes it more attractive to unify the new feature with existing framework.

David

Order file is used to teach ld64 how to order the functions in a binary. If we put all functions executed during startup together in the right order, we will greatly reduce the page faults during startup.

To generate order file for iOS apps, we usually use dtrace, but some apps have various startup scenarios that we want to capture in the order file. dtrace approach is not easy to automate, it is hard to capture the different ways of starting an app without automation. Instrumented builds however can be deployed to phones and profile data can be automatically collected.

For the Facebook app, by looking at the startup distribution, we are expecting a big win out of the order file instrumentation, from 100ms to 500ms+, in startup time.

The basic idea of the pass is to use a circular buffer to log the execution ordering of the functions. We only log the function when it is first executed. Instead of logging the symbol name of the function, we log a pair of integers, with one integer specifying the module id, and the other specifying the function id within the module.

[...]

clang has '-finstrument-function-entry-bare' which inserts a function call and is not as efficient.

Can you elaborate on why this existing functionality is not efficient
enough for you?

For Chrome on Windows, we use -finstrument-functions-after-inlining to
insert calls at function entry (after inlining) that calls a function
which captures the addresses in a buffer, and later symbolizes and
dumps them to an order file that we feed the linker. We use a similar
approach on for Chrome on Android, but I'm not as familiar with the
details there.

Thanks,
Hans

I would love to see this kind of order profiling support. Using dtrace to generate function orders is actually really problematic because dtrace made tradeoffs in implementation allowing it to ignore probe execution if the performance impact is too great on the system. This can result in dtrace being non-deterministic which is not ideal for generating optimization data.

Additionally if order generation could be enabled at the same time as PGO generation that would be a great solution for generating profile data for optimizing clang itself. Clang has some scripts and build-system goop under utils/perf-training that can generate order files using dtrace and PGO data, it would be great to apply this technique to those tools too.

-Chris

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time, and the binary and the profile data will be larger since it needs one number for each function plus additional overhead in the per-function metadata recorded in llvm_prf_data. The buffer size for option (A) is controllable, it needs to be the number of functions executed at startup.

For the Facebook app, we expect that the number of functions executed during startup is 1/3 to 1/2 of all functions in the binary. Profile data size is important since we need to upload the profile data from device to server.

The plus side is to reuse the existing infra!

In terms of integration with PGO instrumentation, both (A) and (B) should work. For (B), we need to increase the number of per function counters by one. For (A), they will be in different sections.

(C) XRay
We have not looked into this, but would like to hear more about it!

(D) -finstrument-functions-after-inlining or -finstrument-function-entry-bare
We are worried about the runtime overhead of calling a separate function when starting up the App.

Thanks,
Manman

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time,

The ‘timestamp’ can be the just an global index. Since there is one counter per func, the counter can be initialized to be ‘-1’ so that you don’t need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :slight_smile:

David

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time,

The ‘timestamp’ can be the just an global index. Since there is one counter per func, the counter can be initialized to be ‘-1’ so that you don’t need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :slight_smile:

That actually works! We only care about the ordering of the functions. But the concern on profile data size and binary size still exist :]

David

and the binary and the profile data will be larger since it needs one number for each function plus additional overhead in the per-function metadata recorded in llvm_prf_data. The buffer size for option (A) is controllable, it needs to be the number of functions executed at startup.

Do you have a rough estimation on how much overhead the per-function metadata is?

Manman

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time,

The ‘timestamp’ can be the just an global index. Since there is one counter per func, the counter can be initialized to be ‘-1’ so that you don’t need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :slight_smile:

That actually works! We only care about the ordering of the functions. But the concern on profile data size and binary size still exist :]

The runtime should be similar as we still need to check if the counter is “-1” before saving the global index. We don’t need the separate bitmap though. Also the counter can be initialized to 0 and the global index can start from 1.

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time,

The ‘timestamp’ can be the just an global index. Since there is one counter per func, the counter can be initialized to be ‘-1’ so that you don’t need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :slight_smile:

That actually works! We only care about the ordering of the functions. But the concern on profile data size and binary size still exist :]

David

and the binary and the profile data will be larger since it needs one number for each function plus additional overhead in the per-function metadata recorded in llvm_prf_data. The buffer size for option (A) is controllable, it needs to be the number of functions executed at startup.

Do you have a rough estimation on how much overhead the per-function metadata is?

For PGO, it is 8 double words for one function, but 7 of the double words are unnecessary. It is entirely reasonable to emit only one double word (reference to name) in per function data when only order profiling is turned on (encode this in the profile header version field). We can delay the support of mixed mode (with PGO instrumentation) later.

David

Some background information first, then a quick summary of what we have discussed so far!

Background: Facebook app is one of the biggest iOS apps. Because of this, we want the instrumentation to be as lightweight as possible in terms of binary size, profile data size, and runtime performance. The plan to improve Facebook app start up time is to (1) implement order file instrumentation to be as light as possible, (2) push the order file instrumentation to internal users first, and then to external beta users if the overhead is low, (3) enable PGO instrumentation to collect information to guide hot/cold splitting, and (4) push PGO instrumentation to internal users.

There are a few alternatives we have discussed:
(A) What is proposed in the initial email: Log (module id, function id) into a circular buffer in its own profile section when a function is first executed.

(B) Re-use existing infra of a per function counter to record the timestamp when a function is first executed
Compared to option (A), the runtime overhead for option (B) should be higher since we will be calling timestamp for each function that is executed at startup time,

The ‘timestamp’ can be the just an global index. Since there is one counter per func, the counter can be initialized to be ‘-1’ so that you don’t need to use bitmap to track if the function has been invoked or not. In other words, the runtime overhead of B) could be lower :slight_smile:

That actually works! We only care about the ordering of the functions. But the concern on profile data size and binary size still exist :]

The runtime should be similar as we still need to check if the counter is “-1” before saving the global index. We don’t need the separate bitmap though. Also the counter can be initialized to 0 and the global index can start from 1.

If we don’t need bitmap, then the two approaches are converging !

David

I chatted with David offline during the weekend. Thanks for the great discussions, David!

The trimmed-down version of the current infra will require 2 x 8 bytes for each function, while the circular buffer implementation requires 4 byte (2 byte for module id, 2 byte for function id) for each startup function. For Facebook app, that means the profile data will be 8 times more. Since we want to push the instrumented build to external test users, we are trying to minimize the uploading from device to servers.

The circular buffer implementation currently uses (module id, function id), which only works in ThinLTO mode. David suggested to decouple from ThinLTO by using the 8-byte MD5 of function names.

I plan to revise the existing patches to decouple from ThinLTO by taking David’s suggestions. Let me know if you have questions on the general approach!

Thanks,
Manman

The plan (decouple from thinLTO and uses circular buffer + bitmap to save memory usage) sounds good to me.

David