[RFC] Adding Matching and Inference Functionality to Propeller

Hi folks,
We would like to present a proposal to add matching and inference functionality to Propeller, a link - time optimization technique, and seek your valuable feedback.

[RFC] Adding Matching and Inference Functionality to Propeller

Authors: Dexin Wu(wudexin@kuaishou.com), Fengxiang Li(lifengxiang@kuaishou.com), Minghui Wu(wuminghui03@kuaishou.com)

Summary

Propeller is a link-time optimization technique that offers optimization effects comparable to BOLT, with lower memory usage and shorter execution time. The current version of propeller has zero tolerance for changes in build parameters and source code. This means that using a stale profile for propeller optimization can lead to negative optimization. Additionally, when using the same profile for both AutoFDO optimization and propeller optimization, it can also result in poor or even negative optimization effects. This issue has a negative impact on the practical use of propeller in production environments.

To address this problem, we referred to the matching and inference functionality in BOLT and plan to introduce it into propeller. In the experiment, we used the same profile file to conduct autofdo and propeller optimizations simultaneously. The results show that after introducing the matching and inference function, the propeller optimization no longer has a negative impact on the autofdo optimization. Instead, it can bring an additional benefit of 2-3%.

Problem statement

In propeller, “hot” Machine Basic Blocks (MBBs) are extracted from functions based on the sampled execution frequencies to form clusters. In the generated cluster configuration file, MBBs are identified by Basic Block IDs (BBIDs). However, BBIDs are extremely unstable. Minor code changes or variations in build parameters can cause significant changes to BBIDs. In this case, if we still use the original BBIDs to identify clusters, it will lead to incorrect identification of “hot” MBBs, resulting in negative optimization. This is the reason why propeller has zero tolerance for changes in build parameters and source code.

Approach

Inspired by the way BOLT handles stale profile files, we use the hash value of MBBs to identify them in the profile file. The calculation method of the MBB hash value is the same as that in BOLT. The hash includes {Offset, OpcodeHash, InstrHash, NeighborHash}. Specifically, the profile contains the hash values of MBBs and the sampled frequency information. When using the profile file, we match the hashes in the profile with the hashes of MBBs. If the match is successful, we mark the frequency of the MBB as the frequency in the profile. After the matching is completed, we use an inference algorithm to estimate the frequencies of other MBBs. Finally, we select the “hot” MBBs based on the complete MBB frequencies.

Since the hash value of MBBs is more stable than BBIDs and is less affected by changes in build parameters and source code, and the inference algorithm further enhances this stability. Therefore, introducing matching and inference can significantly improve propeller’s tolerance for changes in build parameters and source code, making it easier to use in practical production environments. The specific changes include the following points:

  1. Add MBB hashes to the BBAddrMap section: The hash includes four parts: {Offset, OpcodeHash, InstrHash, NeighborHash}.
  • Offset: The offset of the MBB within the function.
  • OpcodeHash: The hash of the instruction opcodes in the MBB (including only opcodes, not operands).
  • InstrHash: The hash of the instructions in the MBB (including both opcodes and operands).
  • NeighborHash: The OpcodeHash of the direct predecessors and successors of the MBB in the Control Flow Graph (CFG).

In the LLVM CodeGen part, add a MachineBlockHashInfo Pass to calculate the hash value of each MBB in every MachineFunction. In the AsmPrinter, write the hash value of the MBB to the BBAddrMap.

  1. Add a new propeller profile file format: The format is as follows:
!foo
!!0x1dce0b85df8f0000 2 0
!!0x28d413540c67002d 1 1
!!0xd7da78be7050003e 1 3
!!0x4e970dad93b80057 0 2

Lines starting with “!” represent the function name. In lines starting with “!!”, the first element is the hash value of the MBB, the second element is the sampled frequency of the MBB, and the third element is the BBID.

In CodeGen, add a FuncHotBBHashesProfileReader Pass to read the new propeller profile format and store the mapping information between the parsed MBB hashes and frequencies.

  1. Add a new HotMachineBasicBlockInfoGenerator Pass: Before generating the cluster of the MachineFunction, add this pass. It first retrieves the mapping information between hash values and frequencies from the FuncHotBBHashesProfileReader and matches the hash values with the MBBs in the current MachineFunction. If the match is successful, it records the frequency of the matched MBB as the corresponding frequency in the profile. The matching method is the same as that in BOLT, and the matching principles are as follows:
  • (a) The OpcodeHash must be the same for a match.
  • (b) From the MBBs with the same OpcodeHash, select the MBBs with the closest distance based on InstrHash and NeighborHash.
  • (c) If multiple MBBs are selected in (b), choose the one with the smallest offset.

After the matching is completed, call the inference algorithm to estimate the frequency information of all MBBs in the MachineFunction. Select the entry MBB and MBBs with a frequency greater than zero as “hot” MBBs.

  1. Create clusters for “hot” MBBs in BasicBlockSections: If the new format of the propeller profile is used, create a cluster for the “hot” MBBs identified in step 3.

How to use matching and inference in propeller?

  1. Build the base binary: Use the -fbasic-block-address-map options to build the base version of the binary file. The generated BBAddrMap section will include the hashes of basic blocks.
$ clang++ -O2 -gline-tables-only -fdebug-info-for-profiling -funique-internal-linkage-names -fbasic-block-address-map code.cc -o base_code
  1. Generate the profile: Run the base binary and use the perf tool to collect samples and obtain perf.data. Use create_llvm_prof to convert it into a profile. Use the --use_bb_hash parameter during the conversion to generate a new propeller format file.
$ perf record -b ./base_code
$ create_llvm_prof --binary=./base_code --format=propeller --use_bb_hash --profile=perf.data --out=code.propeller
  1. Recompile for optimization: Recompile to generate the optimized binary file. During compilation, use the -fpropeller-profile-use compilation parameter to pass in the new propeller profile, and use -mllvm, -propeller-match-infer to enable the matching and inference functionality. The linking parameter needs to use -Wl, --propeller-profile-use to pass in the propeller profile.
$ clang++ -O2 -gline-tables-only -fdebug-info-for-profiling -funique-internal-linkage-names -fpropeller-profile-use=code.propeller -mllvm,-propeller-match-infer -Wl,--no-warn-symbol-ordering -Wl,--propeller-profile-use=code.propeller -fuse-ld=lld code.cc -o base_code

Experimental data

In the experiment, we used clang to compile itself to evaluate the effectiveness of the propeller optimization with the matching and inference function. We denoted the unoptimized clang as base_clang. Since there were significant differences in the MachineFunction between base_clang and the clang optimized by AutoFDO, directly using the profile sampled from base_clang for AutoFDO and propeller optimizations would result in a low matching rate during the matching process, thereby affecting the final outcome.

Therefore, in the actual experiment, we first sampled base_clang and applied AutoFDO optimization to obtain the optimized binary base_autofdo_clang. Next, we sampled base_autofdo_clang to get a profile. Using this profile, we performed AutoFDO optimization to obtain autofdo_clang, conducted both AutoFDO and the original propeller optimization to get autofdo_ori_propeller_clang, and carried out AutoFDO and the propeller optimization with matching and inference to obtain autofdo_my_propeller_clang.

We measured the compilation times of clang by base_clang, autofdo_clang, autofdo_ori_propeller_clang, and autofdo_my_propeller_clang. The results are shown in the following table:

Version Time Optimization Ratio
base_clang 355.3s -
autofdo_clang 308.4s 13.2%
autofdo_ori_propeller_clang 334.5s 5.9%
autofdo_my_propeller_clang 299.8s 15.6%

From the results, we can observe that when using the same profile for both AutoFDO and the original propeller optimizations, the propeller optimization had a negative impact on the AutoFDO optimization. The performance of the optimized clang significantly declined compared to that of the clang optimized only by AutoFDO. However, when using the same profile for AutoFDO and the propeller optimization with the matching and inference function, the propeller optimization did not degrade the effect of AutoFDO optimization. Instead, it brought an additional 2.4% performance improvement.

In addition, we also conducted verification on several C++ services in our company. When using a single profile for AutoFDO and the original propeller optimizations, the propeller optimization always degraded the effect of AutoFDO. In contrast, when using the same profile for AutoFDO and the propeller optimization with the matching and inference function, the propeller optimization on average brought an additional 2% performance improvement.

9 Likes

We have added a patch to the LLVM project to incorporate the relevant code: Adding Matching and Inference Functionality to Propeller by wdx727 · Pull Request #139008 · llvm/llvm-project · GitHub. Everyone is welcome to review it.

It’s great to see the matching/inference technology is being adopted to more use cases, from csspgo to bolt and propeller! Looking forward to review your diffs. Hopefully there is enough overlap between the implementations so that we can re-use some components.

When using a single profile for AutoFDO and the original propeller optimizations, the propeller optimization always degraded the effect of AutoFDO. In contrast, when using the same profile for AutoFDO and the propeller optimization with the matching and inference function, the propeller optimization on average brought an additional 2% performance improvement.

I am not an expert in how exactly Propeller treats profile data, but the results make me suspect that there is a gap somewhere (or equivalently, a further opportunity). Propeller could apply optimization only to the code with known valid profiles and keep everything as is; that should probably avoid perf regressions, even if profile data is stale. This is, of course, orthogonal to the RFC.

Thanks for your reply. Propeller can indeed add more information to the profile to check for invalidation, thereby avoiding performance degradation. As for the current profile format, there is not enough information to achieve this goal. In our solution, when the hash of the MBB (Machine Basic Block) in a MachineFunction has a very low matching rate with the hash recorded in the profile, we will give up optimizing that MachineFunction, which can effectively prevent degradation.

I am not an expert in how exactly Propeller treats profile data, but the results make me suspect that there is a gap somewhere (or equivalently, a further opportunity). Propeller could apply optimization only to the code with known valid profiles and keep everything as is;

This happens today and Propeller piggy-backs on the instrumented FDO (iFDO) hash matching. When iFDO detects that there is a hash mismatch when applying the iFDO profile, Propeller also drops the function. This works as Propeller is most effective when applied over some form of FDO.

Thanks for sharing the proposal. A few questions regarding the experimental methodology.

  1. The proposal offers tolerance towards source code changes, but the experimental setup has no mention of source drift between base and AutoFDO binaries. Is this correct? More generally, what is the expected amount of source and build parameter skew in your production environment? Are you dealing with occasional cherrypicks, or your production releases have longer lapses which introduce much larger source and profile skews?

    A simple way of introducing source drift is to collect the profile from a binary built at an older githash. A more advanced and meaningful skew experiment involves changing hot functions and making sure that the profile is properly recovered. Overall, if source drift is an important goal, additional data points will make the RFC much stronger and aligned with the scope.

  2. One of Propeller’s major optimizations is function splitting (moving cold parts of hot functions to .text.split). If the program uses huge-pages, the precision of this optimization becomes very critical. Empirically, moving a hot block to .text.split has a high cost due to the limited number of hugepage entries in the TLB. Therefore, the splitting strategy needs to be conservative and keep blocks in .text.hot even with a slight chance of hotness. Have you tried this approach on larger (more than 100MBs of text) binaries with hugepages?

  3. Finally, given that using the same profile for AFDO and Propeller is a goal of this proposal, you might be interested in trying out (and possibly comparing your approach against) Propeller-like optimizations via FSAFDO (flow-sensitive AutoFDO). This is currently adopted for AFDO workloads at Google and gives about 1% improvement above AFDO.

    FSAFDO is enabled via -mllvm,-enable-fs-discriminators=true and -mllvm,-improved-fs-discriminator=true and Propeller optimizations are separately enabled via
    -mllvm=-enable-ext-tsp-block-placement=true, -fsplit-machine-functions within the regular compilation pipeline with the same AFDO profile (no need to invoke propeller profile generation). You can also add -mllvm=-salvage-stale-profile=true for improved profile fuzzy matching.
    Instructions for basic FSAFDO can be found in [llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO). Please note that the FSAFDO profile is collected directly from an optimized binary (aka iterative compilation). So to simulate this you would need to collect at least two rounds of AFDO profile to ensure its effectiveness. However, this is not necessary in a production environment with timely releases.

Finally, we would be happy to evaluate this on our internal workloads as well if you’re willing to share your patches.

1 Like

Thanks. We’ll do experiment about source drift and share results.

Our internal workloads have more than 100MBs of text with hugepages, we have not observed any performance degradation at present. Maybe we can stop function splitting when we found one function doesn’t match strictly.

For step 3 and 4, have you benchmarked how much time would this matching, inferencing and clustering increase?

We conducted experiments where Clang was used to compile Clang, aiming to verify the impact of expired profiles at different time points on Propeller’s optimization performance. The results, illustrated in the figure below.

For the original Propeller, the optimization efficacy deteriorates rapidly over time. Specifically, when employing a profile from a year ago, it essentially offers no optimization benefits. In stark contrast, the Propeller enhanced with the matching and inference functionality demonstrates significantly greater resilience. Its optimization performance decays much more gradually, maintaining approximately 8% effectiveness even when utilizing a profile that is a year old.

These findings strongly validate that integrating the matching and inference features substantially improves Propeller’s tolerance to expired profiles, ensuring more consistent optimization performance despite the age of the profiles.

We used Clang to compile Clang and measured the compilation times of the original Propeller and the enhanced version with the matching and inference functionality. The results show that the original Propeller took 356.012 seconds, while the version with matching and inference required 363.624 seconds—a mere 2% increase. This indicates that the matching and inference functionality has a negligible impact on the overall execution time of the optimization process.

Thanks for doing the experiment. The result looks like very promising. What is the baseline flavor? Are you using ThinLTO and any form of PGO (AFDO, or iFDO) before Propeller?

The baseline is normal clang compiling clang. The blue curve is clang optimized by original Propeller with outdated profile compiling clang. The orange curve is clang optimized by Propeller with outdated profile and enabling matching and inference compiling clang. No LTO and PGO here.

Due to the large number of files modified in the previous PR, we plan to split the PR into three parts for separate submission:

Patch 1: Add a pass to calculate the hash value of MachineBlock and write the hash into the bb_address_map section. Related test cases and tools will also be modified.
Patch 2: Introduce a new Propeller profile data format and add a pass to read the profile, storing the mapping between MachineBlock hash values and execution frequencies.
Patch 3: Introduce Matching and Inference techniques to obtain the execution frequencies of all MachineBlocks, generate cluster information for MachineFunction, and use the profile in lld to generate symbol sorting.

Path1 is ready. Everyone is welcome to review it. Adding Matching and Inference Functionality to Propeller-Patch 1 by wdx727 · Pull Request #140886 · llvm/llvm-project · GitHub

If we go forward with the ideas in this proposal, then we are forced to defer all layout decisions to the compiler, as we cannot make any decisions in the offline tool. This opens up two different ways in which Propeller optimizes binaries and might pose a lot of maintenance challenges. In other words,

  1. Stale profile matching enabled implies Propeller’s offline tool passes through the profiles of the MBBs. The compiler matches them and then does layout and splitting decisions.
  2. No Stale profile implies the current behaviour. The offline tool makes the layout and splitting decisions and passes these to the compiler.

This makes the Propeller profile format very complicated and hard to maintain going forward. I am fine with making this converge to something where it just becomes a pass-through and the compiler makes all the decisions. However, this needs to be discussed more broadly as certain optimizations can only be done by the offline tool which can also access the binary that produced the profiles. Does that make sense?

Thank you very much for your response. It is indeed understandable that Propeller’s dual approaches to binary optimization could increase maintenance complexities. I am curious about which specific optimizations in Propeller must necessarily be handled by the offline tool. Perhaps if we design an appropriate profile format to precompute and store all required information offline within the profile, all optimization and layout decisions could potentially be entrusted to the compiler for execution.

After careful consideration, I think that once we infer the weights of all MBBs (Machine Basic Blocks) and edges, both cluster partitioning and path cloning optimizations can be performed within the compiler. If this holds true, our current profile format should already be sufficient to support all Propeller optimizations in the compiler. I’d like to ask if there are any aspects I haven’t considered in this approach.

Let’s not gate this on the sufficiency of the edge profile (It may work for layout, but edge profile is fundamentally incapable of path cloning optimizations). Besides, we’re working on other kinds of optimizations that are purely directive-based and are not based on the edge profile.

For the context of this RFC, I am fine with supporting the edge profile in Propeller. Other directives (layout and cloning) should still be supported and functional.

Thank you for your response. Is the edge profile you mentioned the same as the profile format we are currently designing, or do you expect us to make further modifications?

You can see the edge profile format in llvm-propeller/propeller/testdata/sample_verbose_cc_directives.golden.txt at main · google/llvm-propeller · GitHub

We should probably add documentation for this in llvm/docs/Extensions.rst.

For example, in

#cfg 0:260824,1:102034,2:147684 1:104380,3:104380
the block ID #0 has a total frequency of 260824 and edge frequencies of 102034 for 0–>1 and frequency of 147684 for 0–>2.