PGO is ineffective for Rust - but why?

Interesting. Does PGO mean PGO + ThinLTO here?

This is quite interesting. It suggests that with either a single large compilation unit, or when ThinLTO effectively creates one via lots of importing, that there is over-inlining of things that are presumably not as hot, hurting overall performance. E.g. since the inliner is bottom up, inlining of cold or lukewarm code might be preventing more important inlines further up the call chain, because the function becomes too large. With the split compilation units and more conservative importing, it is presumably importing and therefore inlining the hotter call edges more effectively. I know David has been looking at this type of situation in the inliner.

Yes, the data does seem to suggest that the higher availability of code can actually be detrimental to inlining effectiveness. It’s especially interesting that this also seems to be true for the case where profiling data is available too where one might think that this additional information should allow the inliner to be at least as smart. However, these numbers are from a single codebase and its set of microbenchmarks, so they should be taken with a grain of salt. It might be an interesting lead though.

Interesting. Does PGO mean PGO + ThinLTO here?

For the cases with more than one compilation per crate: yes – but in a slightly restricted sense. By default the Rust compiler performs ThinLTO among all compilation units that comprise a single crate. So ThinLTO is performed on all compilation units of the program; but instead of on the entire set it is done separately on N non-overlapping subsets (where N is the number of crates that make up the program).

The compiler treats PGO orthogonally, so in those cases where ThinLTO is the default, PGO is combined with ThinLTO.