Some questions about Profile guided function order via Temporal Profiling, such as binary size regression

Hi there. I’m Zhuoli.

Background

Recently I saw @ellishg 's aproach ([lld][InstrProf] Profile guided function order by ellishg · Pull Request #96268 · llvm/llvm-project · GitHub) about lld linker level function reorder using the Temporal Profiling. And I’ve watched the talk at https://www.youtube.com/watch?v=yd4pbSTjwuA&ab_channel=LLVM

In the last year, we (a small team in ByteDance, focused on the optimization on mobile performance), already do the similar thing on MachO and arm64 platform via ld64 linker.

So after I heard this, I think I can share some ideas with Ellis’s approach. And I think it’s suitable to make the talk open so anyone who is interested in this can get benefits.

Question 1: The better way to detect hot/cold function via Temporal Profiling

Currently, there are two types of function reorder, one for start-up speed, and another for compress size. In the original paper (Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-Up | ACM Transactions on Embedded Computing Systems), we use the profile to detect two kinds of functions. For hot function, we may run the balanced partition algorithm for startup (use number of page fault as utils-set). For cold function, we may run the algorithm for compression (use sequence hash as utils-set).

The profile trace seems to contain only 2 elements: timestamp (the first call index) and count (all the call count during App).

Currently in the talk, and in implementation, seems we use the simplest choice to detect hot/cold function: InstrProfReader.cpp:

Which means, timestamp > 0 to treat the trace function as hot function, others are all cold function. But I doubt there are any better way.

For example, if a function is called only once during the whole application lifecycle (such as shared instance, or something inside static std::once function), maybe there are still chances to mark this as cold and reorder on this, to gain better compressed binary size instead of launch performance ?

Using something like cost or threshold to adjust the hot/cold splitting algorithm may be better.

Question 2: For large binary (arm64), how can we reduce the binary size regression (not compressed) when too much branch island occurs ?

In ByteDance, our top iOS mobile App has more than 128 MB binary size (for example, for Message App, the un-compressed binary size number is nearly 400MB+)

But, when trying the same approach on large binary, when a long branch instruction occurs, it will create more branch islands. (ld64/doc/design/linker.html at main · apple-oss-distributions/ld64 · GitHub)

In our case, we’ve already built the bpc algorithm into the ld64 linker and run the result, it shows if we arbitrarily reorder all cold functions (nearly 80% function in the full binary), it will cause the un-compressed binary size regression because of increasing islands count.

  • Test result using firefox-ios, 100% cold function, ld64-711

It shows both compressed size and binary size benefits.

  • Test result using Message App, 100% cold function, ld64-907

It shows compressed size benefits but binary size regression.

For this, I’m considering finding some way to reduce the binary size regression:

  1. The simplest way, we only do bpc reorder on small part of code, or via Temporal Profiling. For example, since we have both compiler and linker context, we can only do function reorder on size-sensitive code, like compile units that use -Oz and -Osize in the Swift compiler.
  • Test result using Message App, only reorder -Oz/size function

  1. Enhance of bpc algorithm’s cost function, when moving beyond 128MB (arm64), it has an additional cost, so that this may reduce the long distance reorder

  2. Optimize the ld64’s branch island generation by avoid branch island. I’m not an expert in linker design and implementation. But I’ve found that lld linker uses Trunk instead of branch island for this, here: ConcatOutputSection.cpp

From the implementation, I think the trunk-based solution will have less binary regression when the large number of functions get moved around, but this still needs further testing.

Conclusion

The function reorder idea via Temporal Profiling is great, but for large App, there are still some challenges.

Hope this question and talk are useful for anyone who’s interested in PGO on mobile Apps.:pray:

There are a few ways to partitions functions into hot and cold sets. Since startup performance is critical in the mobile space, we treat any function with timestamp > 0 as hot and order it to improve startup performance. As you have said, this includes functions that are only called during startup. While traditionally these functions might not be considered hot, they contribute to startup performances and so we want to optimize them for that. In retrospect, hot/cold might not be the right terminology and something like startup/steady-state might be better.

We could consider other ways to partition functions: timestamp > 10K could be considered cold (steady-state) since it might not contribute to startup performance. I’ve also imagined a way to feed both page faults and sequence hashes into the Balanced Partitioning algorithm to also optimize the startup set for compression. We would have to take special care to ensure we don’t regress startup performance, or evaluate the trade off. I have not experimented with this yet.

The LLVM InstrProf framework has its own way to determine hot/cold functions. I believe it considers any function with counts above the 99th(?) percentile to be hot. Like I described earlier, this would not include functions critical to startup that are only called once.

Do you know which sections have size regressions in your Messaging App with 100% cold functions? I’m interested to know if it’s only the __text section, or if some other section regressed.

I’m not an expert on branch islands, but I haven’t seen the regression you mentioned. It’s not immediately obvious how you would update the algorithm to prevent the long jump. You may be able to add a utility node for each jump, where a function has that utility node if they are the callee or the caller of that jump.

Thanks for the questions! It’s exciting to see others using this work!

It makes sense if you care about start-up performance.

But actually, in history, we use the traditional way to reorder functions for start-up performance. We use the linker-based instrumentation to record All the functions in final linked binary, and App can provide one end of start-up marker function, on iOS it’s applicationDidFinishLaunching: Objective-C function.

So, we collect all the functions’ order between the first function to the end of start-up marker function, and use this to reorder functions. The old way I think it’s mostly enough for simple cases. (I actually don’t get the idea why using thousands of traces and bps algorithm to reorder the order, is that the App on meta has so much conditional or branches, which can not concluded into a single stable start-up function trace ?)

To say, my first question can be:

Except the functions between the first function to the end of start-up marker function. For other functions which timestamp > 0, can we have a better algorithm for hot/cold splitting? Since these functions are not so critical in App performance, we want to exchange some page-fault performance for compressed size or even binary size.

It sounds like you want functions called before the “startup marker” (applicationDidFinishLaunching: in your case) to be order for page faults and all other functions to be ordered for compressed size. You could disable timestamp instrumentation at that marker.

1 Like

Recently, I updated our ld64 linker implementation with the upstream BalancedPartitioning algorithm and re-run the test.

Here is the result:

Some explanation:

  1. Reordered function: 2.2M cold functions (for which total call count is 0 in my local profile)
  2. llvm-bpc: use our ld64 with llvm BalancedPartitioning code
  3. ld-bpc: use our own bpc algorithm
  4. symbol count/branch island count: use ld64 linkmap file to accurate

From the result, the binary size increment nearly all comes from the increased branch island.

Tip: I even use some diff tool to compare the base linkmap with the reordered one, the diff symbols are all of branch island function

section compare

From your suggestion, I even use objdump -h to compare the binary itself. Here is the result. Ignore the VM address changes, only care about size

Another question: The whole MachO section size diff is 1.71MB, but the binary size diff is 2.05MB, but I don’t know why there are 340KB loss ? Any practice to compare the binary MachO size layout.

This is interesting. I’ll have to check to see if our apps show similar regressions, but I suspect not because I don’t remember large uncompressed __text size regressions.

I see that the __unwind_info section significantly reduced, and I also observed this. In fact, I’ve locally added new hashes to represent the unwind info type for each function to further improve the uncompressed __unwind_info size. When I get the time I want to submit a PR for this, so I’m glad to see you observe this too.

Bloaty is a nice tool to compare binaries.

I’m not exactly sure what you mean though. Are you saying that the accumulated diff of the sections (1.71MB) do not match the diff of the binaries (2.05MB)?

1 Like

I’ll have to check to see if our apps show similar regressions

This branch island only occurs when __TEXT binary size is larger than 128MB, (there are some b/bl instr which jump across 128MB address space). maybe the App you tested does not face this issue.

Are you saying that the accumulated diff of the sections (1.71MB) do not match the diff of the binaries (2.05MB)?

Yes. I found these diff binary size seems comes from the __LINKEDIT section (the binding info), which llvm-objdump does not shows.