Note on testing, we were able to build Clang with Release Full LTO with PGOBBAddrMap being emitted (no PGO profile used though since we don’t have any). We confirmed that it generated the section properly and could be consumed by the readobj and objdump tools.
Tried it on a large binary, with instrumented FDO and ThinLTO, and I can confirm I get at the end the data fine.
I did not check the values other than them looking sane (i.e. there are lots of non-0 values), but that’s something for PR review, my main goal was validating we can build a server side, statically linked (==big) binary with this.
Some anectodal data (in both, entry counts are “on”):
- both bb_freq and br_prob → ~20% size increase
- just bb_freq: ~10% size increase
Looking forward to the PRs!
@red1bluelost may be worth mentioning your find about impact of compile time when profiles are missing? - I don’t think it’s pertinent to this scenario, nor this design, but the find could be interesting to others.
For patch 2 (AsmPrinter where the data is emitted into the ELF), there does seem to be a decent cost to calling MBFI from AsmPrinter, especially if LTO is enabled. From running perf on my clang full LTO build, the cost was within calculating MBFI and running the Loop analysis objects’ destructors.
If you enable bb_freq then you will have to pay this cost, and also br_prob for now but I should be able to make so that br_prob does not use it.
Having an actual PGO profile seems to reduce this cost notably (my guess is that MBFI gets created earlier in the pipeline). I’d be curious if anyone has other theories or knowledge about why PGO profiles would reduce the runtime for calculating MBFI.
MBFI (or LazyMBFI) typically has to be recomputed multiple times. It basically gets invalidated every time the CFG changes. Roughly any pass that does not do
setPreservesAll() will invalidate MBFI, so I wouldn’t be surprised if we need one more round of computation.
With that said, I wouldn’t expect this to be too costly. I would think we already recompute MBFI mutliple times previously and I wouldn’t expect one more round to be that noticeable cost. What do you mean when you say “decent cost”?
In any case I don’t think we should worry too much about this. As this is an optional feature that the majority of users will not enable and there’s no extra cost if disabled, right? So probably good to do some measurements and add the numbers to the patch description, but if it’s not some big percentage for complete compilation time, then I don’t think we need to do anything about it.
When @Mircea was testing it out, bb-freq without a PGO profile would exceed his work’s build timeout but using a PGO profile managed to keep it within the time. The extra time spent compiled is the “decent cost”.
At my work, we don’t have any timeouts but when I perf’ed LLD/clang Full LTO with all pgo-bb-addr-map features, the profile showed that ~98% of the time generating the map was spent within LazyMBFI calculateIfNeeded. In the perf profile, emitBBAddrMap was .62% and calculateIfNeeded .61% within emitBBAddrMap.
Either way, the time cost it shouldn’t be an issue to anyone not interested in our new feature. The feature is completely optional for general users. BBAddrMap users also should not much difference since the same API is preserved and PGO features won’t run. I’ll see if I can time some the code changes compared to regular bb-addr-map.