Enable vectorizer-maximize-bandwidth by default?

Hi,

I’m proposing to make vectorizer-maximize-bandwidth on by default for loop vectorizer because it should generally help performance.

I’ve tested the performance impact on Intel sandybridge machine with speccpu benchmarks:

Benchmark Base:Reference (1)

Besides speccpu, any other real-world applications benefit from this option?

Regards,
chenwj

This sounds good to me. Enabling this by default has been mentioned a few times already. I’ve tested this feature in the past on AArch64 (Kryo and Falkor) and found it to be beneficial for mixed-type loops. Thanks!

Yes, we do see performance benefits for this change on some google internal benchmarks.

Dehao

Hi,

I’m proposing to make vectorizer-maximize-bandwidth on by default for loop vectorizer because it should generally help performance.

I’ve tested the performance impact on Intel sandybridge machine with speccpu benchmarks:

Benchmark Base:Reference (1)

spec/2006/fp/C++/444.namd 26.84 -0.31%
spec/2006/fp/C++/447.dealII 46.19 +0.89%
spec/2006/fp/C++/450.soplex 42.92 -0.44%
spec/2006/fp/C++/453.povray 38.57 -2.25%
spec/2006/fp/C/433.milc 24.54 -0.76%
spec/2006/fp/C/470.lbm 41.08 +0.26%
spec/2006/fp/C/482.sphinx3 47.58 -0.99%
spec/2006/int/C++/471.omnetpp 22.06 +1.87%
spec/2006/int/C++/473.astar 22.65 -0.12%
spec/2006/int/C++/483.xalancbmk 33.69 +4.97%
spec/2006/int/C/400.perlbench 33.43 +1.70%
spec/2006/int/C/401.bzip2 23.02 -0.19%
spec/2006/int/C/403.gcc 32.57 -0.43%
spec/2006/int/C/429.mcf 40.35 +0.27%
spec/2006/int/C/445.gobmk 26.96 +0.06%
spec/2006/int/C/456.hmmer 24.4 +0.19%
spec/2006/int/C/458.sjeng 27.91 -0.08%
spec/2006/int/C/462.libquantum 57.47 -0.20%
spec/2006/int/C/464.h264ref 46.52 +1.35%

geometric mean +0.29%

Scores are benchmark specific.

We do have regression on 453.povray, but it’s due to secondary effects as all hot functions are the same. I’ve also tested the code size impact, it does not change for tested speccpu benchmarks.

Can you please describe the config for the runs (optimization level, PGO/no-PGO, etc).

It would be good to provide analysis for the changes >1%. I.e. we need to make sure that the improvements are not noise either ;).

I’ve prepared https://reviews.llvm.org/D33341 to do this.

I really appreciate if the community can help test the performance impact of this change on other architectures so that we can decide if this should go target-dependent.

I will run it on Cyclone/AArch64 next week.

Adam

Hi,

I'm proposing to make vectorizer-maximize-bandwidth on by default for
loop vectorizer because it should generally help performance.

I've tested the performance impact on Intel sandybridge machine with
speccpu benchmarks:

           Benchmark Base:Reference (1)
-------------------------------------------------------
spec/2006/fp/C++/444.namd 26.84 -0.31%
spec/2006/fp/C++/447.dealII 46.19 +0.89%
spec/2006/fp/C++/450.soplex 42.92 -0.44%
spec/2006/fp/C++/453.povray 38.57 -2.25%
spec/2006/fp/C/433.milc 24.54 -0.76%
spec/2006/fp/C/470.lbm 41.08 +0.26%
spec/2006/fp/C/482.sphinx3 47.58 -0.99%
spec/2006/int/C++/471.omnetpp 22.06 +1.87%
spec/2006/int/C++/473.astar 22.65 -0.12%
spec/2006/int/C++/483.xalancbmk 33.69 +4.97%
spec/2006/int/C/400.perlbench 33.43 +1.70%
spec/2006/int/C/401.bzip2 23.02 -0.19%
spec/2006/int/C/403.gcc 32.57 -0.43%
spec/2006/int/C/429.mcf 40.35 +0.27%
spec/2006/int/C/445.gobmk 26.96 +0.06%
spec/2006/int/C/456.hmmer 24.4 +0.19%
spec/2006/int/C/458.sjeng 27.91 -0.08%
spec/2006/int/C/462.libquantum 57.47 -0.20%
spec/2006/int/C/464.h264ref 46.52 +1.35%

geometric mean +0.29%

  Scores are benchmark specific.

We do have regression on 453.povray, but it's due to secondary effects as
all hot functions are the same. I've also tested the code size impact, it
does not change for tested speccpu benchmarks.

Can you please describe the config for the runs (optimization level,
PGO/no-PGO, etc).

This is O2 build without PGO.

It would be good to provide analysis for the changes >1%. I.e. we need to
make sure that the improvements are not noise either ;).

Good point. I just examined all benchmarks with >1% "improvement". Turns
out they are all noises: the hot functions (with >1% total cycles) are all
identical. So the conclusion is: this change does not affect speccpu2006
performance.

Thanks,
Dehao

Thanks all for the comment. Any other comments on how we should proceed with this?

Thanks,
Dehao

FYI, we’re still waiting on these Adam…

Thank you for running these.

May I suggest testing on AVX2 capable hardware? That would be Intel Haswell, AMD Carrizo and up.
I’m not sure what “vectorizer-maximize-bandwidth” implies, but doubling the vector lanes may help light up parallel regions.

Kevin

If you care about such hardware, please run benchmarks with the flag?

Dehao has made this flag available. It is important that those who care about particular hardware provid ebenchmark results. Not everyone in the community will have access to particular hardware variants.

We’re seeing nice improvements but also significant degradations on IA, which we would like to investigate before the patch is committed.

Major degradations we see:

networking

ip_pktcheckb1m -6.80 %

ip_pktcheckb2m -6.74 %

ip_pktcheckb4m -7.57 %

ip_pktcheckb512k -6.58 %

Telecom

autcor00data_1 -78.02 %

autcor00data_2 -76.80 %

autcor00data_3 -77.00 %

(on Atom)

We still working on creating reproducers.

In general we support this patch, just want to have a chance to investigate the issues. We need a few days for that.

BTW, we also tested on AVX2, and saw a few smaller degradations there:

denbench

cjpegv2data1 -5.02 %

cjpegv2data2 -4.29 %

cjpegv2data3 -4.89 %

cjpegv2data4 -5.44 %

cjpegv2data5 -4.80 %

cjpegv2data6 -3.98 %

cjpegv2data7 -5.48 %

coremark-pro

cjpeg-rose7-preset -4.91 %

core -6.04 %

telecom

autcor00data_1 -2.00 %

I was going to test spec but this does not seem to trigger on spec according to Dehao, so there is really no reason for me to test this. We have some spec perf bots that test trunk; if there is some unexpected regression we should pick it up. Sorry for not being explicit about this.

Adam

Ok, if you’re fine with that, cool.

Only reason I asked was because I wasn’t sure that it fundamentally didn’t fire on SPEC or just didn’t with the particular cost model and might fire/behave differently with a different cost model.

We’re seeing nice improvements but also significant degradations on IA, which we would like to investigate before the patch is committed.

Major degradations we see:

networking

ip_pktcheckb1m -6.80 %

ip_pktcheckb2m -6.74 %

ip_pktcheckb4m -7.57 %

ip_pktcheckb512k -6.58 %

Telecom

autcor00data_1 -78.02 %

autcor00data_2 -76.80 %

autcor00data_3 -77.00 %

(on Atom)

We still working on creating reproducers.

In general we support this patch, just want to have a chance to investigate the issues. We need a few days for that.

I mean, OK… but keep in mind that Dehao’s original email went out over a week ago, so this patch has already been held up a while. As these benchmarks aren’t readily available, we also can’t do anything to help until a test case is posted.

Have you considered contributing these benchmarks to the LLVM test suite?

Hi,

We enabled “vectorizer-maximize-bandwidth” and ran SPEC CPU2006 (base,rate) on Ryzen 8 core, 16 copies with below config:

Base: -m64 -O3 -march=znver1 -mavx2

Base + VMB: -m64 -O3 -march=znver1 -mavx2 -mllvm -vectorizer-maximize-bandwidth

There’s a small uplift for gcc and some small regression for sjeng. Others are within noise levels.

CPU2006 Results:

Does the regression seem acceptable to you? Have you done any analysis of what changed and why it regresses?

Hi Chandler,

We haven’t analyze the failure yet, need some time to investigate.

In general this patch looks OK.

Regards,

Ashutosh

Attached the first reproducer of the Atom\SLM arch. degradation. (~70% degradation).

Chandler, those are part of EEMBC benchmarks.

slm-no-vectorize.ll (2.05 KB)

Thanks for the testcase. Could you add some more details about the regression?

  • How to build/run the testcase to reproduce the degradation?
  • Is the degradation only exist in ATOM/SLM arch? If yes, could you help analyze why maximize the bandwidth would make performance worse (as we do not have these arches to run the perf tests).

Thanks,
Dehao