[RFC][SLP] Let's turn -slp-vectorize-hor on by default

I've done compile-time experiments for AArch64 over SPEC{2000,2006}
and of course the test-suite. I measure no significant compile-time
impact of enabling this feature by default.

I also ran the test-suite on an X86-64 machine. I can't imagine any
other targets being uniquely effected in terms of compile-time by
turning this on after testing both AArch64 and X86-64. I also timed
running the regression tests with -slp-vectorize-hor enabled and
disabled, no significant difference here either.

There are no significant performance regressions (or much
improvements) on AArch64 in night-test suite. I do see wins in third
party benchmarks when using this flag, which is why I'm asking if
there would be any objection from the community to making
-slp-vectorize-hor default on.

I have run the regression tests and looked through the bug tracker /
VC logs, I can't see any reason for not enabling it.

Thanks,
Charlie.

Have you run cpu2006 for x86-64 for perf progression/regression ?

I have not. I could feasibly do this, but I'm not set up to perform
good experiments on X86-64 hardware. Furthermore, if I do it for
X86-64, it only seems fair I should do it for the other backends as
well, which is much less feasible for me. I'm reaching out the
community to see if there's any objection based on their own
measurements of this feature about defaulting it to on.

Please let me know if you think I've got the wrong end of the
etiquette stick here, and if so I'll try and acquire sensible numbers
for other backends.

Kind regards,
Charlie.

I have not. I could feasibly do this, but I'm not set up to perform
good experiments on X86-64 hardware. Furthermore, if I do it for
X86-64, it only seems fair I should do it for the other backends as
well, which is much less feasible for me. I'm reaching out the
community to see if there's any objection based on their own
measurements of this feature about defaulting it to on.

Please let me know if you think I've got the wrong end of the
etiquette stick here, and if so I'll try and acquire sensible numbers
for other backends.

Kind regards,
Charlie.

Have you run cpu2006 for x86-64 for perf progression/regression ?

I think it would be great if you could help Charlie with this.

Sent from my Windows Phone
________________________________
From: Charlie Turner via llvm-dev
Sent: ‎11/‎9/‎2015 11:15 PM
To: llvm-dev@lists.llvm.org
Subject: [llvm-dev] [RFC][SLP] Let's turn -slp-vectorize-hor on by default

I've done compile-time experiments for AArch64 over SPEC{2000,2006}
and of course the test-suite. I measure no significant compile-time
impact of enabling this feature by default.

I also ran the test-suite on an X86-64 machine. I can't imagine any
other targets being uniquely effected in terms of compile-time by
turning this on after testing both AArch64 and X86-64. I also timed
running the regression tests with -slp-vectorize-hor enabled and
disabled, no significant difference here either.

There are no significant performance regressions (or much
improvements) on AArch64 in night-test suite. I do see wins in third
party benchmarks when using this flag, which is why I'm asking if
there would be any objection from the community to making
-slp-vectorize-hor default on.

I have run the regression tests and looked through the bug tracker /
VC logs, I can't see any reason for not enabling it.

+1

If there are no compile time and runtime regressions and if we are seeing wins in some benchmarks then we should enable this by default. At some point we should demote this flag from a command-line flag into a static variable in the code. Out of curiosity, how much of the compile time are we spending in the SLP vectorizer nowadays ?

+1

cheers,
--renato

I will try to get some spec cpu 2006 rate runs done under -O3 -flto with and without -slp-vectorize-hor and let you know.

-Thx

Out of curiosity, how much of the compile time are we spending in the SLP vectorizer nowadays ?

My measurements were originally based off the "real time" reports from
/usr/bin/time (not the bash built-in), so I didn't have per-pass
statistics to hand. I did a quick experiment in which I compiled each
of the SPEC files with opt's -time-passes feature.

The "raw" numbers show that SLP can take anywhere from 0 to 30% of the
total optimization time. At the high end of that scale, things are a
bit fast and loose. Some of the biggest offenders are in rather small
bitcode files (where the total compile time is getting very small as
well)

The largest bitcode file[*] I had in SPEC2006 was about 1MiB. For that
particular example, SLP took less than 1% of the opt time.

For all bitcode files in SPEC2006 between 100KiB and 1MiB, SLP takes
less than 5% of compile time.

In tensor.bc (~ 80KiB) from SPEC2006, SLP took around 9.5% (+- 1%).
This was a borderline case of a compile-time impact with horizontal
reductions (about a 0.8% regression, so within stddev). There were
actually swings the other way as well (i.e., SLP slower without
horizontal reduction detection, so it's hard to make any judgment
here)

Another pretty interesting one is fnpovfpu.bc (~ 40KiB), where SLP
took 17% of compile time.

Anyway, I hope that gives a rough impression of what's going on. I was
taking the wall clock time measurement from -time-passes.

[*] I screwed up initially not reporting the overall compile time in
my haste, so as a proxy metric, I went back and collected bitcode file
sizes, which saved me from having to rerun everything :confused:

Out of curiosity, how much of the compile time are we spending in the SLP vectorizer nowadays ?

My measurements were originally based off the "real time" reports from
/usr/bin/time (not the bash built-in), so I didn't have per-pass
statistics to hand. I did a quick experiment in which I compiled each
of the SPEC files with opt's -time-passes feature.

The "raw" numbers show that SLP can take anywhere from 0 to 30% of the
total optimization time. At the high end of that scale, things are a
bit fast and loose. Some of the biggest offenders are in rather small
bitcode files (where the total compile time is getting very small as
well)

The largest bitcode file[*] I had in SPEC2006 was about 1MiB. For that
particular example, SLP took less than 1% of the opt time.

For all bitcode files in SPEC2006 between 100KiB and 1MiB, SLP takes
less than 5% of compile time.

In tensor.bc (~ 80KiB) from SPEC2006, SLP took around 9.5% (+- 1%).
This was a borderline case of a compile-time impact with horizontal
reductions (about a 0.8% regression, so within stddev). There were
actually swings the other way as well (i.e., SLP slower without
horizontal reduction detection, so it's hard to make any judgment
here)

Another pretty interesting one is fnpovfpu.bc (~ 40KiB), where SLP
took 17% of compile time.

Thanks for the detailed analysis Charlie. We should probably look into fnpovfpu.bc and figure out what’s going on there. Overall I think that the compile time numbers are reasonable.

I will try to get some spec cpu 2006 rate runs done under -O3 -flto with and without -slp-vectorize-hor and let you know.

Do you have a time estimate on when you'll be able to get these
numbers? Another option would be to default the flag on and revert if
this does cause regressions on the targets you're interested in.

TIA,
Charlie.

We have started this. Since there are some holidays expect a small delay. Will let you know by Friday.

Thx

Hi Dibyendu,

Would it make sense for Charlie to flip this switch now, and he can easily flip it back if you get negative benchmark results? Waiting for three days of benchmarking seems a tad overkill for precommit :slight_smile:

James

Cool.No problem.

Thanks, I have enabled it by default in r252733.

--Charlie.