Various forms of measurement bias exist when measuring the effect of a performance optimization. Some of these are highlighted in the paper, Producing Wrong Data Without Doing Anything Obviously Wrong! This RFC presents a technique specifically for controlling the measurement bias referred to as “link order” in the paper. At Google we found that changes to code generation can impact performance results in unexpected ways, especially when the magnitude of the true performance difference is small. The hypothesis is that code size dependent measurement bias occurs as a result of moving code into locations that happen to be more or less favorable to CPU caches and/or the TLB.
The basic idea for controlling for this measurement bias is to use a linker feature for constructing “counterfactuals”, i.e. binaries that are still realistic but have been modified to impact cache favorability. Using these binaries, we can perform several A/B experiments, and by averaging the results of these experiments we can measure the performance effect of a change given an average cache favorability level. In this way, we can know what the average effect of an optimization or other code change would be, and not just what its effect happens to be on two specific binaries used for benchmarking, which would have essentially random cache favorability levels.
Previous work in this area includes the –shuffle-sections flag to LLD, which causes the section order to be shuffled according to the result of a random number generator. However, –shuffle-sections has the following downsides:
- It randomizes section layout, which is not a realistic counterfactual because functions in the same TU are normally clustered together, and this by itself can cause performance regressions due to decreased locality that may lead to measurement bias.
- It prevents the use of the feature together with section layout optimizing features such as Propeller, call-graph sorting and –symbol-ordering-file. So for example if you are trying to measure the performance impact of a change to Propeller, you wouldn’t be able to use --shuffle-sections to control for measurement bias while doing so.
This proposal is to add a –shuffle-padding=SEED flag to LLD. This flag is intended to produce a counterfactual in which the sizes of various functions are slightly larger. We can consider this to be a more realistic counterfactual, because it can be thought of as representing a future version of the program in which changes to the code have increased the sizes of various functions.
Specifically, the flag would do two things:
- Randomly introduce small amounts of padding between sections according to an RNG seeded with SEED. Specifically, it would work as follows: for each input section in an output section, read one byte of random data from the RNG. If the byte is less than some constant representing padding probability, insert sh_addralign bytes of padding before the input section. By default, the flag would only have an effect on output sections that cannot have start/stop symbols, such as .text, .rodata, .data.rel.ro and so on.
- Insert a random amount of padding (between 0 and max-page-size) at the start of each segment (PT_LOAD) using the same RNG.
To control for the measurement bias, a user would need to build their program with -ffunction-sections and set up N A/B experiments, each passing a different seed at link time. The same seed would be passed in the base and experiment case. Each of the A/B experiments will involve M trials. This will give us M*N percentage differences, and we can average the percentage differences and perform a T test to compute the confidence interval.
It should be noted that this is by no means a perfect control for the measurement bias because the insertion of padding would itself introduce a measurement bias, but the magnitude of the bias is expected to be less than with –shuffle-sections, especially with a low padding probability.
For the time being, the probability of inserting padding is hardcoded to 1 in 16. However, we may consider making this configurable.