[Shrink-Wrapping] Request For Benchmarking: X86 and AArch64

Hi,

Shrink-wrapping capabilities, i.e., better placement of prologue and epilogue sequences, landed in r236507 but are not yet enabled by default.

Since r236507 AArch64 is shrink-wrapping ready, meaning we can turn the pass on for this target.
I’ve done the same for X86 in r 238293.

Now, I need your help to test and benchmark how shrink-wrapping perform on those targets.

The goal is to decide whether or not the support is good enough to be enabled by default.

** How Can I Test/Benchmark It? **

Add (-mllvm) -enable-shrink-wrap on your command line or patch the XXXConfigPass to set EnableShrinkWrap to true.
Note the -enable-shrink-wrap=<bool> takes precedence over whatever is set for EnableShrinkWrap.

Please report any problem specific to this optimization turned on. A PR with a small IR to reproduce are appreciated.

Note: I’ve seem up to 4% runtime improvements on the LLVM test-suite + specs for Os and O3.

Thanks in advance for your help,
-Quentin

I’ve run it across a wide variety of server benchmarks we care about. Looks like all the changes are in the noise across sandybridge and ivybridge architectures.

No interesting performance changes (in either direction sadly).

I saw some very minor size fluctuations and dug into it. Turns out there was a missed easy size optimization in it that Quentin has already implemented based on our conversation on IRC.

As far as I can see, this is pure goodness. Let’s turn it on.

Hi Quentin,

This is interesting, I was meaning to look at that at some point. Glad
you did it. :slight_smile:

I just did a quick test-suite run on AArch64 and I'm getting 1% worse
overall in the benchmark set. There were cases over 50% worse
(TSVC/Equivalencing-dbl, TSVC/Symbolics-dbl,
Polybench/medley/reg_detect) and the best case was only 30% better
(ASC_Sequoia/IRSmk).

Geomean difference of compile time is within noise range.

On an a57 device for Spec2000 with -O3 I'm seeing no significant performance changes across the board. With that being said, I think this is pretty cool stuff. Thanks for working on this Quentin.

Chad

Hi Renato,

Hi Quentin,

This is interesting, I was meaning to look at that at some point. Glad
you did it. :slight_smile:

You’re welcome :).

Note: I’ve seem up to 4% runtime improvements on the LLVM test-suite + specs for Os and O3.

Which target?

That was AArch64.

Cheers,
Q.

Note: I’ve seem up to 4% runtime improvements on the LLVM test-suite + specs for Os and O3.

I just did a quick test-suite run on AArch64 and I'm getting 1% worse
overall in the benchmark set. There were cases over 50% worse
(TSVC/Equivalencing-dbl, TSVC/Symbolics-dbl,
Polybench/medley/reg_detect) and the best case was only 30% better
(ASC_Sequoia/IRSmk).

Strange.

I haven’t seen that big swings and the overall performance difference was in the noise but better.
Would you mind looking closer to the diff and file a PR or gave me the command line to reproduce?

Thanks,
-Quentin

Hi Quentin,

I'm still trying to make sense of the results, and I believe that has
to do with not using perf to get the timings, as the noise increases
considerably. Unfortunately, my kernel is a bit on the old side and
manually compiled a long time ago. I wouldn't hold your commit because
of that, as it was within expected ranges anyway.

If I can spot anything bad later, I'll fill a new bug and we can look
at it. Shrink wrapping is an optimisation I wanted, so let's just get
that that in. :slight_smile:

cheers,
--renato