Enabling the SLP vectorizer by default for -O3

Nadav_Rotem1 · July 14, 2013, 6:30am

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

— Performance Gains —
SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
MultiSource/Benchmarks/Olden/power/power -18.55%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
SingleSource/Benchmarks/Misc/flops-6 -11.02%
SingleSource/Benchmarks/Misc/flops-5 -10.03%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%
External/Nurbs/nurbs -7.98%
SingleSource/Benchmarks/Misc/pi -7.29%
External/SPEC/CINT2000/252_eon/252_eon -5.78%
External/SPEC/CFP2006/444_namd/444_namd -4.52%
External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
MultiSource/Applications/SIBsim4/SIBsim4 -3.58%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%
MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
SingleSource/Benchmarks/Misc/flops -1.89%
SingleSource/Benchmarks/Misc/oourafft -1.71%
MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
External/SPEC/CFP2006/447_dealII/447_dealII -1.06%

— Regressions —
MultiSource/Benchmarks/Olden/bh/bh 22.47%
MultiSource/Benchmarks/Bullet/bullet 7.31%
SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68%
SingleSource/Benchmarks/SmallPT/smallpt 3.91%

Thanks,
Nadav

Chandler_Carruth · July 14, 2013, 7:07am

Cool!

What changes have you seen to generated code size?

I’ll take it for a spin on our benchmarks.

Nadav_Rotem1 · July 14, 2013, 7:09am

What changes have you seen to generated code size?

I did not measure code size.

I’ll take it for a spin on our benchmarks.

Thanks!

akorobeynikov · July 14, 2013, 6:24pm

MultiSource/Benchmarks/Olden/bh/bh 22.47%
MultiSource/Benchmarks/Bullet/bullet 7.31%

Looks like quite big regressions. Any idea, why?

Chris_Lattner · July 15, 2013, 4:52am

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though? We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.

-Chris

Nadav_Rotem1 · July 15, 2013, 5:55am

Hi,

LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.

This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though?

Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd’s that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet.

+0x00 movupd 16(%rsi), %xmm0
+0x05 movupd 16(%rsp), %xmm1
+0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ?
+0x0f movapd %xmm0, %xmm2
+0x13 mulsd %xmm2, %xmm2
+0x17 xorpd %xmm1, %xmm1

+0x1b addsd %xmm2, %xmm1

I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern.

On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled.

I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this.

Thanks,
Nadav

Chandler_Carruth · July 15, 2013, 11:44am

It'll be a bit before I can go in and reduce it, but I thought I would
mention that I've seen just one new crasher, and it's on part of the GLU's
reference implementation libtess in normal.c... No real details, but in
case you're aware or someone else knows how to build that...

Renato_Golin1 · July 15, 2013, 1:48pm

Hi Nadav,

I think it’s a great idea to have the slp vectorizer enabled, but maybe we should trim the horrible cases first (regressions, +5% compile time, etc). I don’t mind sub-5% compile-time increase in O3, nor I mind sub-1% regressions in performance on some benchmarks IFF the majority of the benchmarks improve.

Nadav_Rotem1 · July 23, 2013, 10:33pm

Hi,

Sorry for the delay in response. I measured the code size change and noticed small changes in both directions for individual programs. I found a 30k binary size growth for the entire testsuite + SPEC. I attached an updated performance report that includes both compile time and performance measurements.

report.pdf (52.3 KB)

Finkel_Hal_J · July 23, 2013, 11:33pm

Hi,

Sorry for the delay in response. I measured the code size change and
noticed small changes in both directions for individual programs. I
found a 30k binary size growth for the entire testsuite + SPEC. I
attached an updated performance report that includes both compile
time and performance measurements.

I think that these number look good. Regarding the performance regressions:

This looks like noise:
MultiSource/Benchmarks/McCat/08-main/main 44.40% 0.0277 0.0400 0.0000

For these two:
MultiSource/Benchmarks/Olden/bh/bh 19.73% 1.1547 1.3825 0.0017
MultiSource/Benchmarks/Bullet/bullet 7.30% 3.6130 3.8767 0.0069
can you run them on a different CPU and see how generic these slowdowns are?

Thanks again,
Hal

Topic		Replies	Views
Enabling the SLP-vectorizer by default for -O3 LLVM Dev List Archives	7	158	August 1, 2013
[RFC][SLP] Let's turn -slp-vectorize-hor on by default LLVM Dev List Archives	12	141	November 11, 2015
Modifications to SLP LLVM Dev List Archives	4	158	July 8, 2015
[RFC] Make SLP Vectorizer revectorize vector instructions IR & Optimizations	13	906	September 23, 2024
SLP vectorizer on AVX feature LLVM Dev List Archives	13	195	July 1, 2015

Enabling the SLP vectorizer by default for -O3

Related topics