RFC - Extending the PGO analysis map with Propeller CFG frequencies

At Google we often use profile analysis and comparison tools to identify improvement opportunities such as fixing compiler bugs and alignment of the load tests with production. Comparing raw profiles against the production profile is challenging as the mapping between binary addresses and IR is obscure. Fortunately, the PGO analysis map solves this problem by emitting PGO-related block and edge frequencies from BFI and BPI analyses into the binary, and mapping them to binary addresses of basic blocks.

Although this works for binaries built with iFDO and AFDO, the information is insufficient for Propeller-optimized binaries. Propeller performs code layout and cloning optimizations based on a second round of more precise, context-sensitive profile collected on top of PGO(iFDO or AFDO)-optimized binaries. Therefore, access to the profile used by Propeller is crucial to analyze and verify such optimizations.

We propose extending the PGO analysis map data to include the CFG profile from Propeller (via a flag) in addition to the basic PGO data. Unlike the BFI and BPI which include information for all blocks (with or without profiles), the Propeller CFG only contains the blocks with Propeller profiles. So we can efficiently emit the Propeller CFG within the current PGO analysis map structure.

We add a feature bit for every function’s PGO analysis map. This feature bit indicates whether that function includes a Propeller analysis map. If so, every block will include the block frequency (which could be different from the PGO data because of context-sensitivity). If this is nonzero, the structure will also contain the edge frequencies for each successor of that block. This needs minor modifications to the existing encoding and decoding code.

The blob shown below shows the Propeller-related data emitted into the SHT_LLVM_BB_ADDR_MAP section (lines prefixed by ***)

  .section  ".llvm_bb_addr_map","",@llvm_bb_addr_map
  .byte     4               # version number
  .byte     135             # feature byte - PGO+Propeller enabled
  .quad     .Lfunc_begin0   # address of the function
  .uleb128  4               # number of basic blocks
# BB record for BB_0
  .uleb128  0               # BB_0 BB ID
  ...
# BB record for BB_1
  .uleb128  1               # BB_1 BB ID
  ...
# BB record for BB_2
  .uleb128  2               # BB_2 BB ID
  ...
# BB record for BB_3
  .uleb128  3               # BB_3 BB ID
  ...

# PGO Analysis Map

  .uleb128  1000             # function entry count

# PGO data record for BB_0
  .uleb128  1000             # BB_0 basic block frequency
  .uleb128  100              # BB_0 block frequency from Propeller
  .uleb128  3                # BB_0 successors count
  .uleb128  1                # BB_0 successor 1 BB ID

  .uleb128  0x22222222       # BB_0 successor 1 branch probability
 ***.uleb128  60             # BB_0 successor 1 edge frequency from Propeller
  .uleb128  2                # BB_0 successor 2 BB ID
  .uleb128  0x33333333       # BB_0 successor 2 branch probability
 ***.uleb128  0              # BB_0 successor 2 edge frequency from Propeller
  .uleb128  3                # BB_0 successor 3 BB ID
  .uleb128  0xaaaaaaaa       # BB_0 successor 3 branch probability
 ***.uleb128  40             # BB_0 successor 3 edge frequency from Propeller

# PGO data record for BB_1
  .uleb128  133              # BB_1 basic block frequency
  .uleb128 60                # BB_1 block frequency from Propeller
  .uleb128  2                # BB_1 successors count
  .uleb128  2                # BB_1 successor 1 BB ID

  .uleb128  0x11111111       # BB_1 successor 1 branch probability
 ***.uleb128  0              # BB_1 successor 1 edge frequency from Propeller
  .uleb128  3                # BB_1 successor 2 BB ID 
  .uleb128  0x11111111       # BB_1 successor 2 branch probability
 ***.uleb128  60             # BB_1 successor 2 edge frequency from Propeller

# PGO data record for BB_2
  .uleb128  18               # BB_2 basic block frequency
 ***.uleb128  0              # BB_2 block frequency from Propeller
  .uleb128  1                # BB_2 successors count
  .uleb128  3                # BB_2 successor 1 BB ID
  .uleb128  0xffffffff       # BB_2 successor 1 branch probability
 ***## No Propeller edge frequency for this block (zero block frequency)

# PGO data record for BB_3
  .uleb128  1000             # BB_3 basic block frequency
 ***.uleb128  100            # BB_0 block frequency from Propeller*** 
  .uleb128  0                # BB_3 successors count (only enabled with branch probabilities)*

Usability

The new extension can be used by similar profiling frameworks such as Bolt. Bolt already processes hardware profiles and maps them to basic blocks. However, Bolt’s binary writing goes through its own BinaryEmitter rather than AsmPrinter. So some work is probably required to adopt the PGO analysis map feature.

Alternative Considered

We can implement a customized solution in our build system by embedding the CFG profile in the binary. The build system can read the Propeller profile and use a custom encoding scheme to emit it into the binary. This solution provides independence from the compiler but has several disadvantages:

  1. Does not provide the feature for object files (or it’s harder to implement it).

  2. Misses out on the benefit from LLVM tools such as llvm-readobj or llvm-objdump (which are already capable of emitting the PGO analysis map data) as they will be unable to decode the data.

  3. Redundancy: Basic block IDs and their successor are already stored in the PGO analysis map. This solution would have to store them separately.

  4. Inability to access inferred profiles: Although Propeller still uses the profiles directly as read from the profile, there are ongoing efforts towards profile matching and inference in order to tolerate CFG differences. This means the CFG profile could be inferred and mapped in the compiler, similar to BFI. When this happens, the raw CFG profile may not be relevant and we must emit the profile after it’s been mapped to basic blocks.

1 Like

I think this makes sense to implement. Another alternative that we discussed offline was adding a flag to switch whether we’re emitting information from the BFI/BPI analyses or from the Propeller profile. That would have the advantage of not requiring any new encoding/decoding support and not using up an additional feature bit. However, we lose a lot of convenience if we want both PGO and Propeller data, which would be useful in a lot of scenarios where PGOAnalysisMap is currently used. We would need to do two builds which can be pretty expensive, especially since they would be completely identical minus the profile data that gets embedded in the binary. Being able to easily compare the two profiles would be very useful for checking profile fidelity.