Huge Slowdown/Regression in Affine Simplification after upgrading LLVM/MLIR - Lots of time spent in `ParametricStorageUniquier::getOrCreate`

We used to be on an October 2020 version of LLVM (b3b4cda). Recently we synced to a much more recent version from Februrary 2022 (a7ac120) and noticed a massive slow down when running the same code through “Simplify Affine Structures” pass. In the most dramatic case, a workload that used to take seconds in the pass now takes four hours, making it unusable :frowning:

I created a small benchmark that can be run through both the old and new version’s affine simplification pass, the old version takes 7 seconds where the new one takes 28 seconds. (The IR is partially loop-unrolled that still has 50 affine loops left, and the file size is 430 MB before simplification and 300MB after). Running both through perf tool, I noticed that 58% of the time in the new version being spent ParametricStorageUniquier::getOrCreate where the old version does not. 58% seems suspicious? This behaviour does not show in the old version.

New (slow) version’s profile:

I wonder what would’ve changed and how do we mitigate this? We feed very big loop-unrolled programs into affine simplification meaning sometimes the IR has millions of lines… In this benchmark I used same input to compare apple to apple - (only difference is constant vs arith.constant - this got changed for the dialect upgrade). Funny enough, the slow version was built with cmake RelWithDebInfo and the fast version with Debug mode. I’ll get the data points with both RelWithDebInfo soon but I would expect the old version being even faster…

1 Like

The old version’s profile seems much more sane:

You can’t realistically expect that somebody will pinpoint a slowdown cause in 1.5 years worth of commits…

getOrCreate creates new context-owned unique expressions under lock. There have been changes to the uniquer. There have been changes to the affine algorithms. The code seems to be creating a lot of affine expressions that need quite expensive canonicalization first. I would consider looking into how many expressions are created, and what simplifications happen during the process.

Unless I’m missing something, we can’t tell if this is caused by the number of executions of code or the duration of an individual execution of some code. Unsurprisingly, a lot might have changed in the last 1.5 years both in regards to ordering of transformations and what those transformations do. The ordering might matter as the order of transformations sometimes results in more work being done. The transformations might matter as they decide how much time is taken in each unit of work.

I suspect the cause will be related to the transformations being run doing more, but I would recommend also trying to profile how many times patterns were applied to confirm that is not the cause.

Can you try disabling threading (both in the before and after case) and see how it compares? (--mlir-disable-threading on the command line I believe)