Enabling the SLP vectorizer by default for -O3

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

— Performance Gains —
SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
MultiSource/Benchmarks/Olden/power/power -18.55%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
SingleSource/Benchmarks/Misc/flops-6 -11.02%
SingleSource/Benchmarks/Misc/flops-5 -10.03%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%
External/Nurbs/nurbs -7.98%
SingleSource/Benchmarks/Misc/pi -7.29%
External/SPEC/CINT2000/252_eon/252_eon -5.78%
External/SPEC/CFP2006/444_namd/444_namd -4.52%
External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
MultiSource/Applications/SIBsim4/SIBsim4 -3.58%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%
MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
SingleSource/Benchmarks/Misc/flops -1.89%
SingleSource/Benchmarks/Misc/oourafft -1.71%
MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
External/SPEC/CFP2006/447_dealII/447_dealII -1.06%

— Regressions —
MultiSource/Benchmarks/Olden/bh/bh 22.47%
MultiSource/Benchmarks/Bullet/bullet 7.31%
SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68%
SingleSource/Benchmarks/SmallPT/smallpt 3.91%

Thanks,
Nadav

Cool!

What changes have you seen to generated code size?

I’ll take it for a spin on our benchmarks.

What changes have you seen to generated code size?

I did not measure code size.

I’ll take it for a spin on our benchmarks.

Thanks!

MultiSource/Benchmarks/Olden/bh/bh 22.47%
MultiSource/Benchmarks/Bullet/bullet 7.31%

Looks like quite big regressions. Any idea, why?

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though? We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.

-Chris

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though?

Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd’s that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet.

+0x00 movupd 16(%rsi), %xmm0
+0x05 movupd 16(%rsp), %xmm1
+0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ?
+0x0f movapd %xmm0, %xmm2
+0x13 mulsd %xmm2, %xmm2
+0x17 xorpd %xmm1, %xmm1

+0x1b addsd %xmm2, %xmm1

I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern.

On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled.

I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this.

Thanks,
Nadav

It'll be a bit before I can go in and reduce it, but I thought I would
mention that I've seen just one new crasher, and it's on part of the GLU's
reference implementation libtess in normal.c... No real details, but in
case you're aware or someone else knows how to build that...

Hi Nadav,

I think it’s a great idea to have the slp vectorizer enabled, but maybe we should trim the horrible cases first (regressions, +5% compile time, etc). I don’t mind sub-5% compile-time increase in O3, nor I mind sub-1% regressions in performance on some benchmarks IFF the majority of the benchmarks improve.

Hi,

Sorry for the delay in response. I measured the code size change and noticed small changes in both directions for individual programs. I found a 30k binary size growth for the entire testsuite + SPEC. I attached an updated performance report that includes both compile time and performance measurements.

report.pdf (52.3 KB)

Hi,

Sorry for the delay in response. I measured the code size change and
noticed small changes in both directions for individual programs. I
found a 30k binary size growth for the entire testsuite + SPEC. I
attached an updated performance report that includes both compile
time and performance measurements.

I think that these number look good. Regarding the performance regressions:

This looks like noise:
MultiSource/Benchmarks/McCat/08-main/main 44.40% 0.0277 0.0400 0.0000

For these two:
MultiSource/Benchmarks/Olden/bh/bh 19.73% 1.1547 1.3825 0.0017
MultiSource/Benchmarks/Bullet/bullet 7.30% 3.6130 3.8767 0.0069
can you run them on a different CPU and see how generic these slowdowns are?

Thanks again,
Hal