Loop unrolling in large functions and compile time

wolfy1961 · December 13, 2022, 1:57am

Recently we came across a customer application that suffered from a 100% compile time degradation when built with llvm13 (and LTO) vs. llvm12. It turns out that the slowdown is caused by loop unrolling, and more specifically Dominator tree updating performed by the loop unrolling pass.

The loop unrolling pass is invoked many times, but the instances that take a large amount of time occur with extremely large functions. I’m not sure about the relationship between the function size and the time Dom tree updating takes, but it appears to be non-linear.

I was wondering if anyone had any insights on how to best throttle loop unrolling in large functions or if that is even the best way to control compile time. Using #pragma clang loop unroll (disable) would be an obvious suggestion for the user, but it would be interesting to consider an option that prevents unrolling when a certain threshold wrt function size is crossed.

A warning at compile time would also help, so the user can more easily diagnose which function they should pay attention to.

rengolin · December 13, 2022, 10:10am

At the very least this looks like a serious regression between versions. Can you check clang 14 to see if the issue remains?

I’m pinging some people that seem to have worked on the dominator tree analysis around 2020/2021 to see if anything rings a bell. By your description, this sounds like a change in complexity of some analysis, or perhaps just a side effect of a more aggressive unrolling making the already existing complex analysis go much slower.

@preames @serge-sans-paille @compnerd @nikic @jyknight

A smaller reproducer would also help a lot investigate the problem. If you can create an issue on Github with a small example on what trigger a slow down, that’d help a lot.

nikic · December 13, 2022, 10:26am

Based on time-frame and the issue being domtree updates in particular, it’s possible that ⚙ D103561 [LoopUnroll] Reorder code to max dom tree update more obvious [nfc] is related. We switched more parts of unrolling to use DTU. It’s possible that this hits the cutoff and ends up recomputing the DT for the whole function.

Though I’d still find it odd that DT updates by the unroller have a significant impact on end-to-end compile-time, unless we are unrolling many small loops inside a huge function and each one recomputes the DT, or something like that.

(Based on your description, I’m assuming this is not a case where unrolling itself produces a lot of code.)

fhahn · December 13, 2022, 11:34am

I think we saw a similar regression, but it got fixed since then. @wolfy1961 Please check with 15.0/current main.

If it still reproduces, a reproducer would be helpful.

wolfy1961 · December 13, 2022, 6:53pm

unless we are unrolling many small loops inside a huge function and each one recomputes the DT, or something like that.

Yeah, that’s the scenario I’m seeing.

wolfy1961 · December 13, 2022, 6:55pm

I think we saw a similar regression, but it got fixed since then. @wolfy1961 Please check with 15.0/current main.

If it still reproduces, a reproducer would be helpful.

Will do.

wolfy1961 · December 15, 2022, 1:38am

14.0 still shows the problem but 15.0 and current main indeed no longer do.

Thanks everyone for the input. I’ll look for the exact commit that fixed it, unless someone here already knows…

Topic		Replies	Views
(RFC) Adjusting default loop fully unroll threshold LLVM Dev List Archives	39	269	February 17, 2017
[GSoC] Implement a single updater class for Dominators - Final Report LLVM Dev List Archives	0	87	August 10, 2018
[4.0.0 Release] Release Candidate 2 source and binaries available Release Testers	3	82	February 15, 2017
[StaticAnalyzer] LoopUnrolling measurements Clang Frontend	3	123	August 28, 2017
llvm and clang are getting slower LLVM Dev List Archives	37	437	April 1, 2016

Loop unrolling in large functions and compile time

Related topics