[RFC] Control Flow Sensitive AutoFDO (FS-AFDO)

Hi all,

Here I include an RFC for control flow sensitive AutoFDO (FS-AFDO). This is a joint work with David Li. Questions and feedback are welcome.

Thanks,

Rong

Hi Rong,

This is a very interesting proposal. We’ve also observed profile quality degradation from CFG destructive pass like loop rotate, and I can see how this proposal would help improve quality of profile that drives later optimization passes in the pipeline. I have a few questions.

  • How does this affect today’s AutoFDO? Specifically, can users upgrade compiler with FS-AutoFDO support but without refreshing their profile? I think it’s important to make new improvement like this as opt-in, so other users of AutoFDO can choose if and when they want to make the switch from AutoFDO to FS-AutoFDO. With the proposed changes to discriminator encoding, sounds like we are going to eliminate duplication factor etc. altogether. In that case, multiple FS-AutoFDO sample profile loading would be required in order to not regress from today’s AutoFDO due to lack of duplication factor. Is that correct? If so, this change as is can break backward compatibility with today’s AutoFDO profile. It’d be great to handle discriminator in a way that compatible with today’s AutoFDO.

  • Profile loading usually happens early in the pipeline in order to avoid mismatch between profiling and optimization. In the case of FS-AutoFDO, some mismatch may be inevitable for the later FS profile loading. In practice, how significant is the mismatch in later profile loading? Have you tried to see by how much using multiple iterations (like CSPGO) can further help profile quality and improve performance? I understand this is not practical for production use, but could give us data points as to what’s left on the table due to mismatch of later FS profile loading.

  • In your performance experiment, where is the extra FSProfileSampleLoader in the pipeline, right before machine block placement or somewhere else? Have you tried adding more than one FSProfileSampleLoader and is there extra perf gain for more than one FSProfileSampleLoader?

  • For the final discriminator assignment, is this because we take MAX of samples on addresses from the same location? So if there’s any late code duplication after last discriminator assignment, we need a final discriminator assignment to turn MAX into SUM for earlier FS profile?

  • While changing discriminator encoding, have you considered encode block Id into discriminator so AutoFDO can be CFG aware? This is something Wei brought up earlier in the context of CSSPGO, and we didn’t go that route partly because it’s hard to do it without breaking compatibility.

This would be complementary to the context-sensitive SPGO we proposed earlier - would be nice to make PGO/FDO both context-sensitive and flow-sensitive. I think the flow-sensitive part could also integrate with pseudo-probe, essentially we can append FS discriminator later multiple times the same way as you proposed here, except that the “line” part would be “probeId”.

+Hongtao as well.

Thanks,

Wenlei

Hi Rong,

Nice to see this proposal which is quite interesting and useful. Actually I believe it can help address some of the performance-critical issues we have encountered with our AutoFDO workloads. As mentioned previously by Wenlei, we have been dealing with profile counts degradation caused by various transformations, such as loop rotate, simplifyCFG, etc. Relying on BPI/BFI to infer a reasonable counts distribution for duplicated code is sometimes challenging. A per-pass profile will make it easier.

One thing I don’t quite get is how a flow-sensitive profile applied to a specific optimization pass. If I understand correctly, the records of duplicated instructions collected for an optimization pass from the training build are mapped to the corresponding instructions resulted by the same pass in the optimized build. Can you explain a bit more about how the mapping works and how code duplication is tracked? Could a slightly different IR duplication cause incorrect mapping?

It looks like the ID of a pass is considered when adding discriminators for that pass. I guess we are not aiming at supporting every pass with a FS profile, given the size limitation of a 32-bit Dwarf discriminator. What are the first-class passes being supported? Will it be considered to extend the discriminator size for new optimizations?

Thanks,

Hongtao

Hi Wenlei, thanks for comments and questions! Please see my comments inline.

Hi Rong,

This is a very interesting proposal. We’ve also observed profile quality degradation from CFG destructive pass like loop rotate, and I can see how this proposal would help improve quality of profile that drives later optimization passes in the pipeline. I have a few questions.

  • How does this affect today’s AutoFDO? Specifically, can users upgrade compiler with FS-AutoFDO support but without refreshing their profile? I think it’s important to make new improvement like this as opt-in, so other users of AutoFDO can choose if and when they want to make the switch from AutoFDO to FS-AutoFDO. With the proposed changes to discriminator encoding, sounds like we are going to eliminate duplication factor etc. altogether. In that case, multiple FS-AutoFDO sample profile loading would be required in order to not regress from today’s AutoFDO due to lack of duplication factor. Is that correct? If so, this change as is can break backward compatibility with today’s AutoFDO profile. It’d be great to handle discriminator in a way that compatible with today’s AutoFDO.

FS-AFDO has a new discriminator encoding scheme. You are right that it’s not backward compatible with current AFDO profiles. The conflict is mainly from the duplication factors – as it will be completely replaced.
The only possible regression I can think of is using current AFDO profile as an FS-AFDO profile. Due to the difference in encoding, some discriminators might be handled incorrectly.
I think we can add a flag to the AFDO header to let the compiler know if this is the new FS-AFDO profile.

We did have some discussions on the deployment at Google – it will require two stages: the first stage enables the new discriminator scheme but still reads current AFDO profile. In the second stage, we will use the FS-AFDO profile and enable FS-AFDO passes.

  • Profile loading usually happens early in the pipeline in order to avoid mismatch between profiling and optimization. In the case of FS-AutoFDO, some mismatch may be inevitable for the later FS profile loading. In practice, how significant is the mismatch in later profile loading? Have you tried to see by how much using multiple iterations (like CSPGO) can further help profile quality and improve performance? I understand this is not practical for production use, but could give us data points as to what’s left on the table due to mismatch of later FS profile loading.

I did measure the coverage (The same mechanism used in SampleLoader). For our non-triel programs, we can use about 74% of the samples (for the FS-AFDO loader before machine block placement).
I also tried the iterative FS-AFDO. Unfortunately iterative AFDO alone shows a non-trival regression which is larger than the gain for FS-AFDO. I did that experiment a while ago. I might need to revisit it later.

  • In your performance experiment, where is the extra FSProfileSampleLoader in the pipeline, right before machine block placement or somewhere else? Have you tried adding more than one FSProfileSampleLoader and is there extra perf gain for more than one FSProfileSampleLoader?

The main performance is from the FSProfileSampleLoader before machine block placement. I did add one more loader after block placement (and a new BranchFolder pass to do the tail merge). But that did not give me visible performance.
That said, I think there is still lots of tuning work to do for better performance. I will definitely try more experimental loader passes.

  • For the final discriminator assignment, is this because we take MAX of samples on addresses from the same location? So if there’s any late code duplication after last discriminator assignment, we need a final discriminator assignment to turn MAX into SUM for earlier FS profile?

Yes. That’s the main consideration. create_llvm_profile still use MAX for the same locations. The final discriminator assignment is to assign different discrimnatiro for each duplicated code so their sample can be summed.

  • While changing discriminator encoding, have you considered encode block Id into discriminator so AutoFDO can be CFG aware? This is something Wei brought up earlier in the context of CSSPGO, and we didn’t go that route partly because it’s hard to do it without breaking compatibility.

Discriminator only has 32 bits. It’s hard to inject more information. In the early stage of this work, we did consider using BB labels to encode more the flow information. This would use Sri’s work on BasicBlock Sections. But that required significant change than the discriminator based method.

This would be complementary to the context-sensitive SPGO we proposed earlier - would be nice to make PGO/FDO both context-sensitive and flow-sensitive. I think the flow-sensitive part could also integrate with pseudo-probe, essentially we can append FS discriminator later multiple times the same way as you proposed here, except that the “line” part would be “probeId”.

Yes. I think this idea could be used in your pseudo-probe work. The good thing about pseudo-probe is that it does not have a real instruction and the compiler can afford multiple round probe insertions.

Hi Wenlei, thanks for comments and questions! Please see my comments inline.

Hi Rong,

This is a very interesting proposal. We’ve also observed profile quality degradation from CFG destructive pass like loop rotate, and I can see how this proposal would help improve quality of profile that drives later optimization passes in the pipeline. I have a few questions.

  • How does this affect today’s AutoFDO? Specifically, can users upgrade compiler with FS-AutoFDO support but without refreshing their profile? I think it’s important to make new improvement like this as opt-in, so other users of AutoFDO can choose if and when they want to make the switch from AutoFDO to FS-AutoFDO. With the proposed changes to discriminator encoding, sounds like we are going to eliminate duplication factor etc. altogether. In that case, multiple FS-AutoFDO sample profile loading would be required in order to not regress from today’s AutoFDO due to lack of duplication factor. Is that correct? If so, this change as is can break backward compatibility with today’s AutoFDO profile. It’d be great to handle discriminator in a way that compatible with today’s AutoFDO.

FS-AFDO has a new discriminator encoding scheme. You are right that it’s not backward compatible with current AFDO profiles. The conflict is mainly from the duplication factors – as it will be completely replaced.
The only possible regression I can think of is using current AFDO profile as an FS-AFDO profile. Due to the difference in encoding, some discriminators might be handled incorrectly.
I think we can add a flag to the AFDO header to let the compiler know if this is the new FS-AFDO profile.

We did have some discussions on the deployment at Google – it will require two stages: the first stage enables the new discriminator scheme but still reads current AFDO profile. In the second stage, we will use the FS-AFDO profile and enable FS-AFDO passes.

  • Profile loading usually happens early in the pipeline in order to avoid mismatch between profiling and optimization. In the case of FS-AutoFDO, some mismatch may be inevitable for the later FS profile loading. In practice, how significant is the mismatch in later profile loading? Have you tried to see by how much using multiple iterations (like CSPGO) can further help profile quality and improve performance? I understand this is not practical for production use, but could give us data points as to what’s left on the table due to mismatch of later FS profile loading.

I did measure the coverage (The same mechanism used in SampleLoader). For our non-triel programs, we can use about 74% of the samples (for the FS-AFDO loader before machine block placement).
I also tried the iterative FS-AFDO. Unfortunately iterative AFDO alone shows a non-trival regression which is larger than the gain for FS-AFDO. I did that experiment a while ago. I might need to revisit it later.

Was the experiment (iterative FS-AFDO) with one round of FS profile loading (for MBP) or multiple? Like AFDO, transformation decision differences like ifcvt can affect profile quality.

  • In your performance experiment, where is the extra FSProfileSampleLoader in the pipeline, right before machine block placement or somewhere else? Have you tried adding more than one FSProfileSampleLoader and is there extra perf gain for more than one FSProfileSampleLoader?

The main performance is from the FSProfileSampleLoader before machine block placement. I did add one more loader after block placement (and a new BranchFolder pass to do the tail merge). But that did not give me visible performance.
That said, I think there is still lots of tuning work to do for better performance. I will definitely try more experimental loader passes.

  • For the final discriminator assignment, is this because we take MAX of samples on addresses from the same location? So if there’s any late code duplication after last discriminator assignment, we need a final discriminator assignment to turn MAX into SUM for earlier FS profile?

Yes. That’s the main consideration. create_llvm_profile still use MAX for the same locations. The final discriminator assignment is to assign different discrimnatiro for each duplicated code so their sample can be summed.

There is taildup happening in MBP, so there are branch clonings.

Thanks for reply, see comments inline.