[RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

Hi,

Over the past six months, a MemorySSA-backed DSE implementation has been added to LLVM and it now covers almost all cases the existing DSE implementation does, plus adding a major new capability: eliminating stores across basic blocks. Thanks everyone involved with reviews, testing & patches!

I think now would be a good time to start working towards switching to use MemorySSA-backed DSE as default.

TLDR:
MemorySSA DSE can substantially increase the number of stores eliminated in many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto, the number of eliminated stores increases roughly by 45%, with some benchmarks seeing much larger improvements). But the improvements come at additional compile-time cost (CTmark geomean -O3 +0.86%, ReleaseThinLTO +1.59%), due to a substantially increased search space. I’d like to propose to switch to MemorySSA-backed DSE, as in my opinion the benefits in terms of store reductions outweigh the compile-time impact).

Details:

First, lets take a look at the benefits:

1. More dead stores eliminated.

The current DSE implementation is mostly limited to eliminating stores in a single BB. MemorySSA-backed DSE can eliminate stores across different basic blocks. This results in large improvements in some benchmarks. Below some numbers for MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto:

                                                                                               Legacy MSSA
Number of dead stores removed across all benchmarks: 17919 26120 ~ +45.76%
Total number of stores remaining after DSE: 1424860 1409300 ~ -1.1%

Some notable changes (excluding benchmarks with small absolute values of NumFastStores) below. Note that roughly half of the benchmarks have binary changes.

Same hash: 114 (filtered out)
Remaining: 123
Metric: dse.NumFastStores

Program legacy mssa. diff
test-suite...-typeset/consumer-typeset.test 186.00 1815.00 875.8%
test-suite...lications/sqlite3/sqlite3.test 29.00 167.00 475.9%
test-suite...T2006/445.gobmk/445.gobmk.test 19.00 88.00 363.2%
test-suite.../Applications/SPASS/SPASS.test 49.00 155.00 216.3%
test-suite...lications/ClamAV/clamscan.test 72.00 227.00 215.3%
test-suite.../Benchmarks/nbench/nbench.test 30.00 92.00 206.7%
test-suite...T2000/256.bzip2/256.bzip2.test 31.00 85.00 174.2%
test-suite...chmarks/MallocBench/gs/gs.test 57.00 133.00 133.3%
test-suite...000/255.vortex/255.vortex.test 95.00 200.00 110.5%
test-suite...CFP2006/433.milc/433.milc.test 64.00 134.00 109.4%
test-suite...ications/JM/ldecod/ldecod.test 258.00 514.00 99.2%
test-suite.../Trimaran/enc-pc1/enc-pc1.test 87.00 168.00 93.1%
test-suite...ce/Benchmarks/PAQ8p/paq8p.test 35.00 66.00 88.6%
test-suite...INT2000/164.gzip/164.gzip.test 23.00 43.00 87.0%
test-suite...CFP2000/188.ammp/188.ammp.test 85.00 157.00 84.7%
test-suite...pplications/oggenc/oggenc.test 63.00 110.00 74.6%
test-suite...T2006/456.hmmer/456.hmmer.test 24.00 41.00 70.8%
test-suite.../Benchmarks/Bullet/bullet.test 799.00 1351.00 69.1%
test-suite.../CINT2000/176.gcc/176.gcc.test 412.00 668.00 62.1%
test-suite...nsumer-lame/consumer-lame.test 111.00 175.00 57.7%
test-suite...marks/7zip/7zip-benchmark.test 1069.00 1683.00 57.4%
test-suite...006/447.dealII/447.dealII.test 1715.00 2689.00 56.8%

There are a few existing missed optimization issues MemorySSA-backed DSE addresses, e.g.
https://bugs.llvm.org/show_bug.cgi?id=46847
https://bugs.llvm.org/show_bug.cgi?id=40527

And some that should be relatively straight-forward to address with the new implementation:
https://bugs.llvm.org/show_bug.cgi?id=16520
https://bugs.llvm.org/show_bug.cgi?id=10292

2. Major new users for MemorySSA means more testing.

MemorySSA-backed DSE makes extensive use of MemorySSA which increases the test-coverage MemorySSA gets. It already surfaced one issue, that has since been fixed. More users also likely means more eyes on tuning MemorySSA.

3. Eliminating some long-standing MemDepAnalysis related issue.

MemoryDependenceAnalysis used by the current DSE implementation makes heavy use of caching and is relatively fragile with respect to cache invalidation. I think there are at least a few long-standing known issues in that area, e.g. 28014 – [DSE] Infinite loop in DSE with infinite loop

4. Laying ground-work for further optimizations, like cross-bb memcpy opts

The current memcpy optimizers has the same BB-local limitation as the existing DSE. MemorySSA-backed DSE should add most helpers required to port MemCpyOpt to work across basic blocks. Ideally we would do both optimizations as part of the same memorySSA walk, saving some compile-time.

Now, lets take a look at the costs:

1. Increase in compile-time

The current DSE implementation relies heavily on caching of info about memory dependences, relying on the BB-only nature to keep compile-time in check. The cross-bb nature of the MemorySSA-backed DSE implementation drastically increase the search space, thus leading to an increase in compile-time. Here are some numbers for CTMark. Note that the numbers are from a branch with additional compile-time tuning.

Geomean changes (see link for full details)

                                      Executed Instrs. Size (text)
O3 +0.86% -0.29%
ReleaseThinLTO +1.59% -0.38%
ReleaseLTO-g: +1.18% -0.35%
ReleaseThinLTO (link-only) +1.68% -0.38%
ReleaseLTO-g (link only): +1.36% -0.35%

http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=instructions

Note that both the current and MemorySSA-backed DSE implementations rely on various thresholds to avoid excessive compile-times. There are a few knobs to turn to further reduce compile-time of the MemorySSA-backed DSE implementation, at the cost of missing to eliminate some stores.The current settings are chose so the compile-time difference is limited without limiting optimizations too much.

Another thing to note is that the current DSE implementation has seen compile-time tuning over quite a few years now, while the amount of tuning done for the MemorySSA-backed one is relatively small, so I am optimistic that we will see improvements in compile-time in the future.

I took a look at some of the worst regressions ( tramp3d-v4, 7zip) and MemorySSA-backed DSE takes roughly 3x more time than the current DSE, taking DSE from the upper half of passes ordered by time to the top 5, still noticeably quicker than heavy hitters like GVN, ISEL and roughly the same as the most expensive instcombine run.

There are also cases where MemorySSA-backed DSE greatly out-performs legacy DSE: 37588 – DSE slow with many allocas and calls

Given the above, I think there are a few options

a) Switch to MemorySSA-backed DSE in all configurations
b) Switch to memorySSA-backed DSE, but use more aggressive limits. For some configurations (e.g. limit optimizations for -O2/-Os)
c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz only)
d) Do not switch.

I’d like to propose opting for a), because I think the additional optimizations warrant the increase in compile-time and I am optimistic we see further tuning once it gets used heavily and that it will enable additional optimizations.

I think b) would also be an improvement over the current status, but we’d have to balance to limits so we still get substantially enough improvements over the current DSE.

I don’t think c) is great, as we would have to maintain 2 versions of DSE, with both of them seeing half the testing.

d) is also not great, as we are currently are failing to optimize a substantial number of dead stores and are missing future optimizations made possible by the implementation.

If there are no objections to switching, I’ll work on getting the remaining compile-time related patches in ASAP and flip the switch. That should give us enough time to address issues in time for the next release and fine tune compile-time.

I’m looking forward to your thoughts!

Cheers,
Florian

Note that both the current and MemorySSA-backed DSE implementations rely on various thresholds to avoid excessive compile-times. There are a few knobs to turn to further reduce compile-time of the MemorySSA-backed DSE implementation, at the cost of missing to eliminate some stores.The current settings are chose so the compile-time difference is limited without limiting optimizations too much.

Did you compare the new algorithm with the knobs down enough to take roughly the same compile time as the old one?

b) Switch to memorySSA-backed DSE, but use more aggressive limits. For some configurations (e.g. limit optimizations for -O2/-Os)
c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz only)

If we do switch, and I’m not advocating for that yet, I’d do a mix of those.

Use slightly more aggressive limits at O2, those limits at O3 and perhaps add an extra argument to push even further for people who really need it and can pay the extra compile time.

–renato

Thanks for all the work. The reductions in stores look promising. Do you also have performance numbers how much this improves the execution time? Did you observe any regressions where MSSA resulted in fewer removed stores?

Michael

Note that both the current and MemorySSA-backed DSE implementations rely on various thresholds to avoid excessive compile-times. There are a few knobs to turn to further reduce compile-time of the MemorySSA-backed DSE implementation, at the cost of missing to eliminate some stores.The current settings are chose so the compile-time difference is limited without limiting optimizations too much.

Did you compare the new algorithm with the knobs down enough to take roughly the same compile time as the old one?

Without spending too much time on that, the best I could do while not regressing on the total number of stores removed was +0.30% (for CTMark) executed instructions for -O3 with ~3% more stores removed across the large benchmark set. (Executed instruction +0.28% for ReleaseLTO, +0.58% for ReleaseThinLTO).

So MemorySSA DSE probably can get quite close to legacy DSE at legacy DSE’s own game, but intuitively I think the MemorySSA approach is just slightly more expensive in general, because it does not just handle cases that can be cached easily and allows for more flexibility.

Use slightly more aggressive limits at O2, those limits at O3 and perhaps add an extra argument to push even further for people who really need it and can pay the extra compile time.

That is certainly an option, but I think ideally the settings do not get too fragmented.

Cheers,
Florian

I would second the notion to switch to the new implementation. The < 2% compile time cost seems well justified by the results, and removing a major user of MDA - which has had *major* latent correctness issues - seems very worthwhile. We should continue to tune compile time, but we should do that after making the switch.

Philip

Do you also have any numbers on the impact on (peak) RSS this has?

Joerg

I did not gather numbers for execution time yet, but I’ll try to share some tomorrow.

At the current state, for MultiSource/SPEC2000/SPEC2006, there are the following regressions

test-suite...6/471.omnetpp/471.omnetpp.test 321.00 316.00 -1.6%
test-suite...arks/mafft/pairlocalalign.test 64.00 61.00 -4.7%
test-suite...ks/Prolangs-C++/city/city.test 23.00 21.00 -8.7%
test-suite...oxyApps-C/miniGMG/miniGMG.test 69.00 60.00 -13.0%

I suspect those are caused by the few cases that the MemorySSA version does not yet support. Those will need some more investigating, but I think ideally we would not block the switch on them, so we can switch early and address the issues that pop up early.

Cheers,
Florian

Max RSS looks mostly neutral for CTMark.

-O3 -0.01%
ReleaseThinLTO +0.10%
ReleaseLTO-g +0.0%
O0-g: +0.0%
ReleaseThinLTO (link only): -0.23%
ReleaseLTO-g (link only): -0.09%

http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=max-rss

There appear to be some +- 2% swings for ReleaseThinLTO which might be worth taking a closer look, but overall it looks not too bad I think.

Cheers,
Florian

Here are some execution time results for ARM64 with -O3 -flto with the MemorySSA-DSE compared against the current DSE implementation for CINT2006 (negative % means reduction in execution time with MemorySSA-DSE). This excludes small changes within the noise (<= 0.5%)

                                                                          Exec_time number of stores removed
test-suite...T2006/456.hmmer/456.hmmer.test -1.6%. + 70.8%
test-suite.../CINT2006/403.gcc/403.gcc.test -1.4%. + 35.7%
test-suite...0.perlbench/400.perlbench.test -1.2%. + 33.2%
test-suite...3.xalancbmk/483.xalancbmk.test -1.0%. + 3.02%
test-suite...T2006/401.bzip2/401.bzip2.test -0.8%. + 70.6%

Hi Florian,

First, thank you for working on this. I’m really glad to see this work so close to being enabled.

I think the numbers look good for run time, and the benefits of switching for all configurations are clear.

For compile time, the current regressions are noticeable, but not a deal breaker in my opinion. I’m very much in favor of switching in all configurations.

To address some of the concerns, it may make sense to lower the threshold somewhat to minimize impact at this time (we won’t have benefits as large at the time of the switch). I’m talking about getting the geomean closer to 1% in all configurations if possible.
I believe that the regressions introduced by this flag flip can be undone by further using MemorySSA in the other passes currently using MemDepAnalysis, and offsetting the cost of computing MemorySSA in the first place. The threshold could be raised again to enable more stores eliminated once the MemCpyOpt+MSSA and NewGVN become the default.

If reducing the thresholds is not possible or removes most of the run time benefits, I would vote for enabling as is.

Best,
Alina

Thank you for the performance numbers. IMHO they justify switching to MSSA-DSE for all configuration even with slight compile time regressions.

Michael

Hi,

Thanks for all the responses!

My understanding is that there were no objections so far against trading a bit of additional compile-time for more eliminated stores. Unless there are any new concerns raised, I’ll continue working on submitting the remaining patches to keep compile-time in check and flip the switch once they are in.

Hi Florian,

First, thank you for working on this. I'm really glad to see this work so close to being enabled.

I think the numbers look good for run time, and the benefits of switching for all configurations are clear.

For compile time, the current regressions are noticeable, but not a deal breaker in my opinion. I'm very much in favor of switching in all configurations.

To address some of the concerns, it may make sense to lower the threshold somewhat to minimize impact at this time (we won't have benefits as large at the time of the switch). I'm talking about getting the geomean closer to 1% in all configurations if possible.
I believe that the regressions introduced by this flag flip can be undone by further using MemorySSA in the other passes currently using MemDepAnalysis, and offsetting the cost of computing MemorySSA in the first place. The threshold could be raised again to enable more stores eliminated once the MemCpyOpt+MSSA and NewGVN become the default.

If reducing the thresholds is not possible or removes most of the run time benefits, I would vote for enabling as is.

I’ve been working on some additional changes to bring compile-time down a bit further, while keeping the number of eliminated stores at a similar level. Geomeans:
-O3 +0.62%
ReleaseThinLTO +1.06%,
ReleaseLTO-g +0.79% Re
ReleaseThinLTO (link-only) +0.96%
ReleaseLTO-G (link only). +0.80%

http://llvm-compile-time-tracker.com/compare.php?from=e19ef1aab524ef10a2d118adcd9f6fd6ca2d7ca9&to=c6812412983b4ebb20f1e32b83893284c8117e7f&stat=instructions

This also includes changes so DSE can re-use MemorySSA constructed by LICM earlier during LTO. The limits could certainly be a bit tighter still, but I hope they might encourage a few additional eyes on DSE + MemorySSA compile-time wise, which would be great. Also, triggering as often as possible is probably beneficial in terms of flushing out any remaining issues early.

Cheers,
Florian

Hi Florian,

Following up on D86967, I missed that all the timings were using the legacy pass manager.
Did you do any testing on the compile and run time impact for the new pass manager?

Thank you,
Alina

All numbers shared earlier where indeed with the default configuration/LPM.

I did some testing with the new PM today (using ENABLE_EXPERIMENTAL_NEW_PASS_MANAGER). In terms of removed stores, the improvements are very similar to the legacy pass manager.

In terms of compile-time, the impact is a bit bigger unfortunately, due to the fact that we need more extra computations of MemorySSA. For example, in the regular pipeline, DSE is scheduled just before LICM, both of which use MemorySSA. With the LPM, if I understand correctly, LICM’s loop pass manager constructs MemorySSA even if there is no loop in a function. So in the LPM, the number of times MemorySSA is computed stays the same.

With the NPW, it seems like LICM’s loop pass manager does only construct MemorySSA if there are actual loops in a function. This is great, but unfortunately means that using DSE + MemorySSA introduces an extra MemorySSA construction for each function without loops. For tramp3d-v4 this introduces 3x more MemorySSA constructions when running DSE + MemorySSA followed by LICM on the LTO module, for example.

I do not think there are any short-term solutions for this in the NPM, but once more passes, e.g. NewGVN or MemCpyOptimizer, are using MemorySSA, the cost should be somewhat amortized.

So here are the latest CTMark numbers with DSE + MemorySSA enabled. Note that they those measurements do not include the compile-time improvements made recently to MemDepAnalysis a week ago (https://github.com/llvm/llvm-project/commit/3a54b6a4b71c21cf3bab4f132cbc2904fb9d997e). I think this highlights the benefits of switching as soon as possible, so the DSE + MemorySSA implementation can benefit from more additional and fresh eyes with focus on compile-time.

New Pass Manager
exec instrs. size-text
O3 + 0.95% - 0.25%
ReleaseThinLTO + 1.28% - 0.41%
ReleaseLTO-g. + 1.64% - 0.35%
RelThinLTO (link only) + 0.93% - 0.41%
RelLO-g (link only). + 2.09% - 0.35%

Details:
http://195.201.131.214:8000/compare.php?from=7fa5828950d060a70eb57e721765da7cc67c5695&to=b5e5f6ed0f583ddd7a032c274e013335216d6372&stat=instructions

Legacy Pass Manager
exec instrs. size-text
O3 + 0.67% - 0.27%
ReleaseThinLTO + 1.07% - 0.42%
ReleaseLTO-g. + 0.81% - 0.33%
RelThinLTO (link only) + 0.94% - 0.42%
RelLO-g (link only). + 0.78% - 0.33%

Details:
http://llvm-compile-time-tracker.com/compare.php?from=7fa5828950d060a70eb57e721765da7cc67c5695&to=b5e5f6ed0f583ddd7a032c274e013335216d6372&stat=instructions

There currently are 2 patches pending until we should be ready to flip the switch:

https://reviews.llvm.org/D86651 [MemCpyOpt] Preserve MemorySSA.
https://reviews.llvm.org/D86815 [LangRef] Adjust guarantee for llvm.memcpy to also allow equal arguments.

Cheers,
Florian

Just another quick update: the patch to flip the switch is now up as well https://reviews.llvm.org/D87163

I plan on flipping the switch once the review gets approved, unless there are any remaining major concerns. Please let me know if you see any regressions/problems after the switch.

Cheers,
Florian