Hi there. I’m Zhuoli.
Background
Recently I saw @ellishg 's aproach ([lld][InstrProf] Profile guided function order by ellishg · Pull Request #96268 · llvm/llvm-project · GitHub) about lld linker level function reorder using the Temporal Profiling. And I’ve watched the talk at https://www.youtube.com/watch?v=yd4pbSTjwuA&ab_channel=LLVM
In the last year, we (a small team in ByteDance, focused on the optimization on mobile performance), already do the similar thing on MachO and arm64 platform via ld64 linker.
So after I heard this, I think I can share some ideas with Ellis’s approach. And I think it’s suitable to make the talk open so anyone who is interested in this can get benefits.
Question 1: The better way to detect hot/cold function via Temporal Profiling
Currently, there are two types of function reorder, one for start-up speed, and another for compress size. In the original paper (Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-Up | ACM Transactions on Embedded Computing Systems), we use the profile to detect two kinds of functions. For hot function, we may run the balanced partition algorithm for startup (use number of page fault as utils-set). For cold function, we may run the algorithm for compression (use sequence hash as utils-set).
The profile trace seems to contain only 2 elements: timestamp (the first call index) and count (all the call count during App).
Currently in the talk, and in implementation, seems we use the simplest choice to detect hot/cold function: InstrProfReader.cpp:
Which means, timestamp > 0
to treat the trace function as hot function, others are all cold function. But I doubt there are any better way.
For example, if a function is called only once during the whole application lifecycle (such as shared instance, or something inside static std::once function), maybe there are still chances to mark this as cold and reorder on this, to gain better compressed binary size instead of launch performance ?
Using something like cost or threshold to adjust the hot/cold splitting algorithm may be better.
Question 2: For large binary (arm64), how can we reduce the binary size regression (not compressed) when too much branch island occurs ?
In ByteDance, our top iOS mobile App has more than 128 MB binary size (for example, for Message App, the un-compressed binary size number is nearly 400MB+)
But, when trying the same approach on large binary, when a long branch instruction occurs, it will create more branch islands. (ld64/doc/design/linker.html at main · apple-oss-distributions/ld64 · GitHub)
In our case, we’ve already built the bpc algorithm into the ld64 linker and run the result, it shows if we arbitrarily reorder all cold functions (nearly 80% function in the full binary), it will cause the un-compressed binary size regression because of increasing islands count.
- Test result using firefox-ios, 100% cold function, ld64-711
It shows both compressed size and binary size benefits.
- Test result using Message App, 100% cold function, ld64-907
It shows compressed size benefits but binary size regression.
For this, I’m considering finding some way to reduce the binary size regression:
- The simplest way, we only do bpc reorder on small part of code, or via Temporal Profiling. For example, since we have both compiler and linker context, we can only do function reorder on size-sensitive code, like compile units that use
-Oz
and-Osize
in the Swift compiler.
- Test result using Message App, only reorder
-Oz/size
function
-
Enhance of bpc algorithm’s cost function, when moving beyond 128MB (arm64), it has an additional cost, so that this may reduce the
long distance reorder
-
Optimize the ld64’s branch island generation by avoid branch island. I’m not an expert in linker design and implementation. But I’ve found that lld linker uses Trunk instead of branch island for this, here: ConcatOutputSection.cpp
From the implementation, I think the trunk-based solution will have less binary regression when the large number of functions get moved around, but this still needs further testing.
Conclusion
The function reorder idea via Temporal Profiling is great, but for large App, there are still some challenges.
Hope this question and talk are useful for anyone who’s interested in PGO on mobile Apps.