BOLT: Accuracy of instrumentation mode?

The documentation for -instrument mode in BOLT presents it as a second choice if you don’t have perf available or can’t use it for some reason.

However, what are the downsides of using instrument mode?

The obvious one is that you need to instrument the binary, which takes the process slower, unsuitable for production, etc. However, is the profile information collected also worse in some sense? Intuitively, since BOLT seems to be collecting “exact” information about the dynamic instruction flow (branches taken, etc) in a hardware independent way, it seems like the quality of the profile might even be better since it is e.g., not subject to sampling error.

I ask because I’m using the instrument mode and wonder what I’m missing. This mode is very convenient to use e.g., in CI where one cannot generally expect hardware performance counters to be available.

You got all pros and cons of the instrumentation right. The quality is the best with instrumentation, assuming you are able to collect the profile under the same conditions. E.g., since the binary is slower, it could running differently (compared to non-instrumented) in production if the load balancer is not sending it the same traffic.

Another downside of --instrument is that we need to implement BOLT instrumentation and runtime support for different architectures. But that’s a minor inconvenience if you consider that LBRs are missing on many platforms and require way more effort to be added to hardware and OS.

Thanks for your answer.

Good point about the possibility of the instrumented binary getting less or different load in your load balancer example. I guess it could even apply within a single process, e.g., where a program does a type of self-benchmark to select an algorithm to use (results might be changed by BOLT) or where timing changes threading behavior e.g. profile may vary if an inter-thread queue is nearly always empty vs full and this might change as BOLT may change (for example) the relative consumer and producer performance.

In my case, however, which is compiling a BOLT-instrumented version of clang, I guess this does not apply.