(RFC) Adjusting default loop fully unroll threshold

Currently, loop fully unroller shares the same default threshold as loop dynamic unroller and partial unroller. This seems conservative because unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to double the threshold for loop fully unroller. This will change the codegen of several SPECCPU benchmarks:

Code size:
447.dealII 0.50%
453.povray 0.42%
433.milc 0.20%
445.gobmk 0.32%
403.gcc 0.05%
464.h264ref 3.62%

Compile Time:
447.dealII 0.22%
453.povray -0.16%
433.milc 0.09%
445.gobmk -2.43%
403.gcc 0.06%
464.h264ref 3.21%

Performance (on intel sandybridge):
447.dealII +0.07%
453.povray +1.79%
433.milc +1.02%
445.gobmk +0.56%
403.gcc -0.16%
464.h264ref -0.41%

Looks like the change has overall positive performance impact with very small code size/compile time overhead. Now the question is shall we make this change default in O2, or shall we leave it in O3. We would like to have more input from the community to make the decision.

Thanks,

Dehao

Intuitively (correct me if I am wrong) I would think loop unrolling is a more risky operation that is good in many/most cases but can be detrimental to performance (by blowing up code sizes and I-Caches). So I would rather put that into -O3.

  • Matthias

Currently, loop fully unroller shares the same default threshold as loop
dynamic unroller and partial unroller. This seems conservative because
unlike dynamic/partial unrolling, fully unrolling will not affect
LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to
double the threshold for loop fully unroller. This will change the codegen
of several SPECCPU benchmarks:

Code size:
447.dealII 0.50%
453.povray 0.42%
433.milc 0.20%
445.gobmk 0.32%
403.gcc 0.05%
464.h264ref 3.62%

Compile Time:
447.dealII 0.22%
453.povray -0.16%
433.milc 0.09%
445.gobmk -2.43%
403.gcc 0.06%
464.h264ref 3.21%

Performance (on intel sandybridge):
447.dealII +0.07%
453.povray +1.79%
433.milc +1.02%
445.gobmk +0.56%
403.gcc -0.16%
464.h264ref -0.41%

Looks like the change has overall positive performance impact with very
small code size/compile time overhead. Now the question is shall we make
this change default in O2, or shall we leave it in O3. We would like to
have more input from the community to make the decision.

Intuitively (correct me if I am wrong) I would think loop unrolling is a
more risky operation that is good in many/most cases but can be detrimental
to performance (by blowing up code sizes and I-Caches). So I would rather
put that into -O3.

Yes, it applies to runtime/partial loop unroll. That's why we want to set a
conservative threshold to make sure unrolled loop does not blow up loop
stream detector (LSD) and L1I. But for fully unroll, it only handles cases
where loop trip count is small, so the LSD will never activate. By
completely flattening the loop, the loop became straight-line code thus we
do not have L1I issue for the flattened loop. That's why we think the
threshold should be more aggressive for fully unroller.

Dehao

Can you clarify how to read these numbers? (I’m using +xx% to indicates a slowdown usually, it seems you’re doing the opposite?).

So considering 464.h264ref, does it mean it is 3.6% slower to compile, gets 3.2% larger, and 0.4% slower?

Another question is about PGO integration: is it already hooked there? Should we have a more aggressive threshold in a hot function? (Assuming we’re willing to spend some binary size there but not on the cold path).

Thanks,

Currently, loop fully unroller shares the same default threshold as loop dynamic unroller and partial unroller. This seems conservative because unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to double the threshold for loop fully unroller. This will change the codegen of several SPECCPU benchmarks:

Code size:
447.dealII 0.50%
453.povray 0.42%
433.milc 0.20%
445.gobmk 0.32%
403.gcc 0.05%
464.h264ref 3.62%

Compile Time:
447.dealII 0.22%
453.povray -0.16%
433.milc 0.09%
445.gobmk -2.43%
403.gcc 0.06%
464.h264ref 3.21%

Performance (on intel sandybridge):
447.dealII +0.07%
453.povray +1.79%
433.milc +1.02%
445.gobmk +0.56%
403.gcc -0.16%
464.h264ref -0.41%

Can you clarify how to read these numbers? (I’m using +xx% to indicates a slowdown usually, it seems you’re doing the opposite?).

So considering 464.h264ref, does it mean it is 3.6% slower to compile, gets 3.2% larger, and 0.4% slower?

Another question is about PGO integration: is it already hooked there? Should we have a more aggressive threshold in a hot function? (Assuming we’re willing to spend some binary size there but not on the cold path).

I would even wire the unrolling the other way: just suppress unrolling in cold paths to save binary size. rolled loops seem like a generally good thing in cold code unless they are having some larger impact (IE, the loop itself is more expensive than the unrolled form).

Currently, loop fully unroller shares the same default threshold as loop
dynamic unroller and partial unroller. This seems conservative because
unlike dynamic/partial unrolling, fully unrolling will not affect
LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed
to double the threshold for loop fully unroller. This will change the
codegen of several SPECCPU benchmarks:

Code size:
447.dealII 0.50%
453.povray 0.42%
433.milc 0.20%
445.gobmk 0.32%
403.gcc 0.05%
464.h264ref 3.62%

Compile Time:
447.dealII 0.22%
453.povray -0.16%
433.milc 0.09%
445.gobmk -2.43%
403.gcc 0.06%
464.h264ref 3.21%

Performance (on intel sandybridge):
447.dealII +0.07%
453.povray +1.79%
433.milc +1.02%
445.gobmk +0.56%
403.gcc -0.16%
464.h264ref -0.41%

Can you clarify how to read these numbers? (I’m using +xx% to indicates a
slowdown usually, it seems you’re doing the opposite?).

As this is comparing spec scores instead of run time, +xx% here means
speedup, -xx% means slowdown.

So considering 464.h264ref, does it mean it is 3.6% slower to compile,
gets 3.2% larger, and 0.4% slower?

That is correct. The 0.4% slowdown is in the run-to-run noise range.

Another question is about PGO integration: is it already hooked there?
Should we have a more aggressive threshold in a hot function? (Assuming
we’re willing to spend some binary size there but not on the cold path).

I would even wire the *unrolling* the other way: just suppress unrolling
in cold paths to save binary size. rolled loops seem like a generally good
thing in cold code unless they are having some larger impact (IE, the loop
itself is more expensive than the unrolled form).

Agree that we could suppress unrolling in cold path to save code size. But
that's orthogonal with the propose here. This proposal focuses on O2
performance: shall we have different (higher) fully unroll threshold than
dynamic/partial unroll.

We can have a separate patch to further boost threshold for hot loops and
suppress unrolling for cold loops. One concern is that in order to check if
a loop is hot/cold, we will need BFI for the loop pass. In the legacy loop
pass manager, this will insert a function pass in the middle of a series of
loop passes.

Dehao

Currently, loop fully unroller shares the same default threshold as loop dynamic unroller and partial unroller. This seems conservative because unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to double the threshold for loop fully unroller. This will change the codegen of several SPECCPU benchmarks:

Code size:
447.dealII 0.50%
453.povray 0.42%
433.milc 0.20%
445.gobmk 0.32%
403.gcc 0.05%
464.h264ref 3.62%

Compile Time:
447.dealII 0.22%
453.povray -0.16%
433.milc 0.09%
445.gobmk -2.43%
403.gcc 0.06%
464.h264ref 3.21%

Performance (on intel sandybridge):
447.dealII +0.07%
453.povray +1.79%
433.milc +1.02%
445.gobmk +0.56%
403.gcc -0.16%
464.h264ref -0.41%

Can you clarify how to read these numbers? (I’m using +xx% to indicates a slowdown usually, it seems you’re doing the opposite?).

As this is comparing spec scores instead of run time, +xx% here means speedup, -xx% means slowdown.

So considering 464.h264ref, does it mean it is 3.6% slower to compile, gets 3.2% larger, and 0.4% slower?

That is correct. The 0.4% slowdown is in the run-to-run noise range.

Ok, thanks for the clarifications.
What about the noise on the improvements? How reliable are you other numbers on this aspect?

Another question is about PGO integration: is it already hooked there? Should we have a more aggressive threshold in a hot function? (Assuming we’re willing to spend some binary size there but not on the cold path).

I would even wire the unrolling the other way: just suppress unrolling in cold paths to save binary size. rolled loops seem like a generally good thing in cold code unless they are having some larger impact (IE, the loop itself is more expensive than the unrolled form).

Agree that we could suppress unrolling in cold path to save code size. But that’s orthogonal with the propose here. This proposal focuses on O2 performance: shall we have different (higher) fully unroll threshold than dynamic/partial unroll.

I agree that this is (to some extent) orthogonal, and it makes sense to me to differentiate the threshold for full unroll and the dynamic/partial case.

Thanks,

The general direction here seems fine (we should re-evaluate these kinds of thresholds from time to time).

The other high level question I have here is: why 2x? Why not 4x?

I think it would be good to show some data from a reasonable spread and where some reasonable point is in the curve to draw the line.

I’m also mildly worried about basing something like this on SPEC which involves relatively small programs. It might be good to ask others to provide data as well on code size / performance tradeoffs for larger applications. (Chrome for example.)

And it might be worth writing down the methodology used to set the thresholds so we start gaining some consistency on this front.

Another question is about PGO integration: is it already hooked there? Should we have a more aggressive threshold in a hot function? (Assuming we’re willing to spend some binary size there but not on the cold path).

I would even wire the unrolling the other way: just suppress unrolling in cold paths to save binary size. rolled loops seem like a generally good thing in cold code unless they are having some larger impact (IE, the loop itself is more expensive than the unrolled form).

Agree that we could suppress unrolling in cold path to save code size. But that’s orthogonal with the propose here. This proposal focuses on O2 performance: shall we have different (higher) fully unroll threshold than dynamic/partial unroll.

I agree that this is (to some extent) orthogonal, and it makes sense to me to differentiate the threshold for full unroll and the dynamic/partial case.

There is one issue that makes these not orthogonal.

If even static profile hints will reduce some of the code size increase caused by higher unrolling thresholds for non-cold code, we should factor that into the tradeoff in picking where the threshold goes.

However, getting PGO into the full unroller is currently challenging outside of the new pass manager. We already have some unfortunate hacks around this in LoopUnswitch that are making the port of it to the new PM more annoying.

Recollected the data from trunk head with stddev data and more threshold data points attached:

Performance:


stddev/mean 300 450 600 750
403 0.37% 0.11% 0.11% 0.09% 0.79%
433 0.14% 0.51% 0.25% -0.63% -0.29%
445 0.08% 0.48% 0.89% 0.12% 0.83%
447 0.16% 3.50% 2.69% 3.66% 3.59%
453 0.11% 1.49% 0.45% -0.07% 0.78%
464 0.17% 0.75% 1.80% 1.86% 1.54%

Code size:


300 450 600 750
403 0.56% 2.41% 2.74% 3.75%
433 0.96% 2.84% 4.19% 4.87%
445 2.16% 3.62% 4.48% 5.88%
447 2.96% 5.09% 6.74% 8.89%
453 0.94% 1.67% 2.73% 2.96%
464 8.02% 13.50% 20.51% 26.59%

Compile time is proportional in the experiments and more noisy, so I did not include it.

We have >2% speedup on some google internal benchmarks when switching the threshold from 150 to 300.

Dehao

With the new data points, any comments on whether this can justify setting fully inline threshold to 300 (or any other number) in O2? I can collect more data points if it’s helpful.

Thanks,
Dehao

I had suggested having size metrics from somewhat larger applications such as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal binaries with rough size brackets?

clang, chrome, and some internal large apps are good candidates for size metrics.

David

clang, chrome, and some internal large apps are good candidates for size metrics.

I’d also add the standard LLVM testsuite just because it’s the suite everyone in the community can use.

Michael

Here is the code size impact for clang, chrome and 24 google internal benchmarks (name omited, 14 15 16 are encoding/decoding benchmarks similar as h264). There are 2 columns, for threshold 300 and 450 respectively.

I also tested the llvm test suite. Changing the threshold to 300/450 does not affect code gen for any binary in the test suite.

Ping… with the updated code size impact data, any more comments? Any more data that would be interesting to collect?

Thanks,
Dehao

Sorry if I missed it, but what machine/CPU are you using to collect the perf numbers?

I am concerned that what may be a win on a CPU that keeps a couple of hundred instructions in-flight and has many MB of caches will not hold for a small core.

Is the proposed change universal? Is there a way to undo it?

In my experience, unrolling tends to help weaker cores even more than stronger ones because it allows the instruction scheduler more opportunities to hide latency. Obviously, instruction-cache pressure is an important consideration, but the code size changes here seems small. All of the unrolling thresholds should be target-adjustable using the TTI::getUnrollingPreferences hook. -Hal

Thanks every for the comments.

Do we have a decision here?

Dehao

You’re good to go as far as I’m concerned. -Hal