LSR ComplexityLimit default value

Hi! I want to kindly attract your attention to this topic and propose to discuss updating ComplexityLimit constant in LoopStrengthReduce
(llvm-project/LoopStrengthReduce.cpp at main · llvm/llvm-project · GitHub)
from uint16_t::max to 512k. In our experiments on some benchmarks (e.g. lmBench lmbench/src at master · intel/lmbench · GitHub ) it shows good performance improvements for arm/aarch64 targets:


  • aarch64-linux-gnu:
Benchmark Improvement
lmbench/par_ops_double_div 4.93%
lmbench/par_ops_integer_bit 4.53%
lmbench/lat_mem_rd_l1 4.22%
lmbench/par_ops_int64_add 3.22%

  • arm-linux-android:
Benchmark Improvement
lmbench/par_ops_integer_bit 5.35%
lmbench/par_ops_double_div 4.19%
lmbench/par_ops_float_mul 4.09%

I and my team also observe improvements in our internal benchmarks.
And it is worth to mention that this change does not strongly affect compilation time: we see 3.65% (for LSR pass) in worst case.

What is your opinion: may default value of ComplexityLimit be changed (at least for arm/aarch64), or it is better to leave it as it is for saving compilation time?

So, how do you think, is it worth to reevaluate ComplexityLimit default value?

@sparker-arm maybe you have some commentaries about it?

Hi,

Thank you for sharing the experiment results!

I think the performance looks promising as the LSR pass is doing a
heuristic search on possible fix-ups. Do you also have data for
compile time increase when ComplexityLimit is increased?

We saw that compile time increased for ~3.65% (for LSR pass itself) in worst case.

Cool. Please allow me to have some time and experiment this on the RISC-V backend on SPECint.

Hi @kpdev42

I seem to remember I played with the limit because we aggressively unroll for Arm Cortex-M and LSR just wasn’t doing as well as it could with larger loops. I can’t remember the kind of gains we saw, but we stuck with a higher limit, though I can’t remember what, for our downstream toolchain.

I think it’s definitely worth exploring increasing the default.

Hi,

Experiment on the RISC-V backend with increased complexity limit shows worse performance than the original limit. I suspect there is still problems for IR transformed by LSR for RISC-V, causing some overwrites to make things worse.

The overall performance decreased by 0.19%. With sjeng decreasing the most with 0.40%. From the perspective of RISC-V backend developer, I would discourage the default change for now until we find better codegen in the backend.

Regards,

eop Chen

Interesting. Yes, looks like there are some pitfalls in LSR for RISC-V, maybe you checked (or guess) a root cause for it?