Enabling Loop Distribution Pass as default in the pipeline of new pass manager

Hi All,

My colleague Sanne has found performance improvement with ‘-enable-loop-distribute’ option from hmmer on SPEC2006.

On the hmmer, there is a loop with dependence. The Loop Distribute pass splits the loop into three sperate loops. One loop has still dependence, another is vectorizable, the other is vectorizable after running LoopBoundSplit pass which needs to be updated a bit. On AArch64, we have seen 40% improvement with enabling Loop Distribute pass and 80% improvement with enabling Loop Distribute pass and LoopBoundSplit from hmmer on SPEC2006.

From llvm-test-suite and spec benchmarks, I have not seen any performance degradation with enabling the Loop Distribute pass because almost all tests are not handled by Loop Distribute pass with mainly below messages. I think the messages are reasonable.

Skipping; memory operations are safe for vectorization

Skipping; no unsafe dependences to isolate

Skipping; multiple exit blocks

For compile time, there is no big change because the almost all tests are not handled by the pass due to mainly above three reasons which comes from cached analysis information.

At this moment, we can enable the pass with metadata or command line option. If possible, can we enable the Loop Distribute pass as default in the pipeline of new pass manager please?

Thanks

JinGu Kang

2 Likes

My 2 cents:

It’s not really convincing if a pass trigger on only benchmark case. But on the other hand, if it is a really cheap pass to run (compile-times) and benefits a case, then why not? Perhaps you need to quantify this to make it more convincing. Additional benefit of enabling it by default is that it gets more exposure and testing, which I think is a good thing.

Lastly, is there anything we can learn from GCC here? E.g., do they have this enabled, and perhaps support more/other cases?

typo:

It’s not really convincing if a pass trigger on only benchmark case.

→ trigger on only 1 benchmark case.

I’d be in favour of enabling loop distribution by default as long as it doesn’t hurt compile-time when it’s not needed.

FWIW GCC enables this by default to get the speedup on hmmer. I don’t know enough about the LLVM implementation to compare with GCC’s, but GCC’s loop distribution pass aims to help vectorisation and help detect manual memset, memcpy implementations (I think LLVM does that detection in another pass).

You can read the high-level GCC design in the source: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/tree-loop-distribution.c;h=65aa1df4abae2c6acf40299f710bc62ee6bacc07;hb=HEAD#l39

Thanks,

Kyrill

1 Like

The LoopDistribute pass doesn't do anything unless it sees
llvm.loop.distribute.enable (`#pragma clang loop distribute(enable)`)
because it does not have a profitability heuristic. It cannot say
whether loop distribution is good for performance or not. What makes
it improve hmmer is that the distributed loops can be vectoried.
However, LoopDistribute is located before the vectorizer and cannot
say in advance whether a distributed loop will be vectorized or not.
If not, then it potentially only increased loop overhead.

To make -enable-loop-distribute on by default would mean that we could
consider loop distribution to be usually beneficial without causing
major regressions. We need a lot more data to support that conclusion.

Alternatively, we could consider loop-distribution a canonicalization.
A later LoopFuse would do the profitability heuristic to re-fuse loops
again if loop distribution did not gain anything.

Michael

I appreciate your replies. I have seen below performance data.

For AArch64, the performance data from llvm-test-suite is as below.

Metric: exec_time

Program results_base results_loop_dist diff
test-suite...ications/JM/lencod/lencod.test 3.95 4.29 8.8%
test-suite...emCmp<5, GreaterThanZero, Mid> 1456.09 1574.29 8.1%
test-suite...st:BM_BAND_LIN_EQ_LAMBDA/44217 22.83 24.50 7.3%
test-suite....test:BM_BAND_LIN_EQ_RAW/44217 23.00 24.17 5.1%
test-suite...st:BM_INT_PREDICT_LAMBDA/44217 589.54 616.70 4.6%
test-suite...t:BENCHMARK_asin_novec_double_ 330.25 342.17 3.6%
test-suite...ow-dbl/GlobalDataFlow-dbl.test 2.58 2.67 3.3%
test-suite...da.test:BM_PIC_2D_LAMBDA/44217 781.30 806.36 3.2%
test-suite...est:BM_ENERGY_CALC_LAMBDA/5001 63.02 64.93 3.0%
test-suite...gebra/kernels/syr2k/syr2k.test 6.53 6.73 3.0%
test-suite...t/StatementReordering-flt.test 2.33 2.40 2.8%
test-suite...sCRaw.test:BM_PIC_2D_RAW/44217 789.90 810.05 2.6%
test-suite...s/gramschmidt/gramschmidt.test 1.44 1.48 2.5%
test-suite...Raw.test:BM_HYDRO_1D_RAW/44217 38.42 39.37 2.5%
test-suite....test:BM_INT_PREDICT_RAW/44217 597.73 612.34 2.4%
Geomean difference -0.0%
        results_base results_loop_dist diff
count 584.000000 584.000000 584.000000
mean 2761.681991 2759.451499 -0.000020
std 30145.555650 30124.858004 0.011093
min 0.608782 0.608729 -0.116286
25% 3.125425 3.106625 -0.000461
50% 130.212207 130.582658 0.000004
75% 602.708659 612.931769 0.000438
max 511340.880000 511059.980000 0.087630

For AArch64, the performance data from SPEC benchmark is as below.

SPEC2006
Benchmark Improvement(%)
400.perlbench -1.786911228
401.bzip2 -3.174199894
403.gcc 0.717990522
429.mcf 2.053027806
445.gobmk 0.775388165
456.hmmer 43.39308377
458.sjeng 0.133933093
462.libquantum 4.647923489
464.h264ref -0.059568786
471.omnetpp 1.352515266
473.astar 0.362752409
483.xalancbmk 0.746580249
    
SPEC2017
Benchmark Improvement(%)
500.perlbench_r 0.415424516
502.gcc_r -0.112915812
505.mcf_r 0.238633706
520.omnetpp_r 0.114830748
523.xalancbmk_r 0.460107636
525.x264_r -0.401915964
531.deepsjeng_r 0.010064227
541.leela_r 0.394797504
557.xz_r 0.111781366

Thanks
JinGu Kang

For considering the LoopDistribute pass as a canonicalization with the profitability heuristic of LoopFuse pass, it looks the LoopFuse pass does not also have proper profitability function.

If possible, I would like to enable the LoopDistribute pass based on the performance data.

As you can see on the previous email, the Geomean difference from llvm-test-suite is -0.0%. From spec benchmarks, we can see 43% performance improvement on 456.hmmer of SPEC2006. Based on this data, I think we could say the pass is usually beneficial without causing major regression.

How do you think about it?

Thanks
JinGu Kang

Based on this data, I think we could say the pass is usually beneficial without causing major regression.

I think we need to look at compile-times too before we can draw that conclusion, i.e. we need to justify it’s worth spending extra compile-time for optimising a few cases. Hopefully loop distribution is a cheap pass to run (also when it is running but not triggering), but that’s something that needs to be checked I think.

The compile time data is as below. There could be a bit noise but it looks there is no big compile time regression.

From llvm-test-suite

Metric: compile_time

Program results_base results_loop_dist diff

test-suite…arks/VersaBench/dbms/dbms.test 0.94 0.95 1.6%

test-suite…s/MallocBench/cfrac/cfrac.test 0.89 0.90 1.5%

test-suite…ks/Prolangs-C/gnugo/gnugo.test 0.72 0.73 1.4%

test-suite…yApps-C++/PENNANT/PENNANT.test 8.65 8.75 1.2%

test-suite…marks/Ptrdist/yacr2/yacr2.test 0.84 0.85 1.1%

test-suite…/Builtins/Int128/Builtins.test 0.86 0.87 1.0%

test-suite…s/ASC_Sequoia/AMGmk/AMGmk.test 0.69 0.70 1.0%

test-suite…decode/alacconvert-decode.test 1.16 1.17 0.9%

test-suite…encode/alacconvert-encode.test 1.16 1.17 0.9%

test-suite…peg2/mpeg2dec/mpeg2decode.test 1.71 1.72 0.9%

test-suite…/Applications/spiff/spiff.test 0.88 0.89 0.9%

test-suite…terpolation/Interpolation.test 0.96 0.97 0.9%

test-suite…chmarks/MallocBench/gs/gs.test 4.58 4.62 0.9%

test-suite…-C++/stepanov_abstraction.test 0.69 0.70 0.8%

test-suite…marks/7zip/7zip-benchmark.test 52.35 52.74 0.7%

Geomean difference nan%

results_base results_loop_dist diff

count 117.000000 118.000000 117.000000

mean 4.636126 4.616575 0.002171

std 7.725991 7.737663 0.006310

min 0.607300 0.602200 -0.041930

25% 1.345700 1.313650 -0.001577

50% 1.887000 1.888800 0.002463

75% 4.340800 4.343275 0.005754

max 52.351200 52.736000 0.015861

From SPEC2017

I think we need to look at compile-times too before we can draw that conclusion, i.e. we need to justify it's worth spending extra compile-time for optimising a few cases. Hopefully loop distribution is a cheap pass to run (also when it is running but not triggering), but that's something that needs to be checked I think.

LoopDistribute currently already iterates over all loops to find the
llvm.loop.distribute.enable metadata. Additional compile-time overhead
would be the LoopAccessAnalysis which could be done for cheap if
LoopAccessAnalysis is used for LoopVectorize anyways.

________________________________
From: Jingu Kang <Jingu.Kang@arm.com>
Sent: 21 June 2021 14:27
To: Michael Kruse <llvmdev@meinersbur.de>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Sjoerd Meijer <Sjoerd.Meijer@arm.com>
Cc: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>
Subject: RE: [llvm-dev] Enabling Loop Distribution Pass as default in the pipeline of new pass manager

For some reason I cannot find this email in my inbox, although it was
definitely sent to the mailing-list:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151306.html
So I am replying within Sjoerd's email.

For considering the LoopDistribute pass as a canonicalization with the profitability heuristic of LoopFuse pass, it looks the LoopFuse pass does not also have proper profitability function.

Within the loop optimization working group we were considering adding
a heuristic to LoopFuse. but is also not restricted to innermost
loops. However, the advantage is that it could run after LoopVectorize
and re-fuse loops that turned out to be non-vectorizable, or to loops
that have been vectorized independently. Unfortunately I think the
legality/profitability is comparatively expensive since it does not
unse LoopAccessAnalsysis,

Michael

[adding nikc to CC]

@nikc Would you consider this amount of regression acceptable?

@nikic If you need more information for loop distribute pass, please let me know.

Thanks

JinGu Kang

Sorry for Ping.

As I mentioned on previous email, if you need more information for enabling the loop distribute pass, please let me know. @Michael @nikic

Regards

JinGu Kang

Hi,

Hi,

Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.

It would be good to have some fresh numbers on how often LoopDistribute triggers. From what I remember, there are a handful of cases in the test suite, but nothing that significantly affects performance (other than hmmer, obviously).

Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?

Unfortunately, there were some problems with this effort. First, the current implementation of LoopDistribute relies heavily on LoopAccessAnalysis, which made it difficult to adapt.

More importantly though, I’m not convinced that LoopDistribute will be beneficial other than in cases where it enables more vectorization. (The memcpy detection gcc might be interesting, I didn’t look at that.) It reduces both ILP and MLP, which in some cases might be made up by lower register or cache pressure, but this is hard or impossible for the compiler to know.

While working on this, with a more aggressive LoopDistribute across several benchmarks, I did not see any improvements that didn’t turn out to be noise, and plenty of cases where it was actively degrading performance.

Therefore, I’m not sure this direction is worth pursuing further, and I believe the current heuristic of “distribute when it enables new vectorization” is actually pretty reasonable, if not very general.

Cheers,
Sanne

Ping.

Additionally, I was not able to see the pass triggered from llvm-test-suite and spec benchmark except hmmer.

Thanks

JinGu Kang

Hi,

Hi,

Do you have any data on how often LoopDistribute triggers on a larger set of programs (like llvm-test-suite + SPEC)? AFAIK the implementation is very limited at the moment (geared towards catching the case in hmmer) and I suspect lack of generality is one of the reasons why it is not enabled by default yet.

It would be good to have some fresh numbers on how often LoopDistribute triggers. From what I remember, there are a handful of cases in the test suite, but nothing that significantly affects performance (other than hmmer, obviously).

Also, there’s been an effort to improve the cost-modeling for LoopDistribute (https://reviews.llvm.org/D100381) Should we make progress in that direction first, before enabling by default?

Unfortunately, there were some problems with this effort. First, the current implementation of LoopDistribute relies heavily on LoopAccessAnalysis, which made it difficult to adapt.

More importantly though, I’m not convinced that LoopDistribute will be beneficial other than in cases where it enables more vectorization. (The memcpy detection gcc might be interesting, I didn’t look at that.) It reduces both ILP and MLP, which in some cases might be made up by lower register or cache pressure, but this is hard or impossible for the compiler to know.

I think we should be able to make an educated guess at least if we wanted to, although it won’t be straightforward. I think there can be cases where loop distribution can be beneficial on its own, especially for large loops where enough parallelism remains after distributing, but they can be highly target-specific.

While working on this, with a more aggressive LoopDistribute across several benchmarks, I did not see any improvements that didn’t turn out to be noise, and plenty of cases where it was actively degrading performance.

Thanks for the update! It might be good to close the loop on the review as well?

Cheers,
Florian

Hi Florian,

Thanks for your kind reply.

The loop distribute pass was not triggered with one of below messages on almost all tests.

Skipping; memory operations are safe for vectorization

Skipping; no unsafe dependences to isolate

Skipping; multiple exit blocks

It looks like the first and second message are reasonable.

From third message, we could try to improve the pass to handle loop with the multiple exit blocks… I am not sure how much effort we need for it…

Thanks

JinGu Kang

I find the LoopDistribute doesn’t active even with #pragma clang loop distribute(enable), Compiler Explorer