LoopAccessAnalysis cache problem with new PM

Hi!

Spent lots of time today trying to debug a problem where I end up with sporadic crashes, asserts, etc.
It happens occasionally if I run the same opt test case over and over again, and I see the failures when LoopDistribution is running.

In the IR there are 2 functions, "f" and "g". Function "f" has several loops, and function "g" has one loop.

My command line is basically "opt -O3 -enable-loop-distribute -debug-pass-manager=verbose -debug-only=loop-distribute,loop-accesses", but it seems to fail more often (or more seldom) if adding some other parameters as well.

I've added some debug printouts in LoopDistributeForLoop::processLoop to print the Loop* L variable, and in LoopBase c'tor/d'tor, as that seemed to be relevant.

What I've found out is that when it goes bad the output is like this:

......................
<cut>
Running pass: LoopSimplifyPass on f
Running analysis: LoopAnalysis on f
*ADDED-DEBUG* Loop is created: 0x891cc60

<cut>
Running pass: LoopDistributePass on f

LDist: In "f" checking Loop at depth 3 containing: %for.body10.us.us<header><latch><exiting>
*ADDED-DEBUG* L used when requesting LAA: 0x891cd08
Running analysis: LoopAccessAnalysis on Loop at depth 3 containing: %for.body10.us.us<header><latch><exiting>

LDist: In "f" checking Loop at depth 3 containing: %for.body89<header><latch><exiting>
*ADDED-DEBUG* L used when requesting LAA: 0x891cc60
Running analysis: LoopAccessAnalysis on Loop at depth 3 containing: %for.body89<header><latch><exiting>

<cut>
Running pass: SimpleLoopUnswitchPass on Loop at depth 2 containing: %for.cond1.preheader<header><exiting>,%for.body10.us.us,%cleanup.thread.split.us.us,%for.body89,%crit_edge,%for.inc100<latch><exiting>,%for.body89.preheader,%for.body10.us.us.preheader
*ADDED-DEBUG* Loop is deleted: 0x891cc60

<cut>
Running pass: LoopSimplifyPass on g
Running analysis: LoopAnalysis on g
*ADDED-DEBUG* Loop is created: 0x891cc60

<cut>
Running pass: LoopDistributePass on g

LDist: In "g" checking Loop at depth 1 containing: %for.cond<header><latch><exiting>
*ADDED-DEBUG* L used when requesting LAA: 0x891cc60

<cut>
......................

And then the pass manager will return the LoopAccessInfo cached for the Loop in function "f".
(Notice that the Loop object happened to reuse the same address for the Loop object used earlier.)

I don't know if the using object pointers as key to the cached analysis like this is safe in general.
But I guess the problem here is that the cache should have been cleared earlier (when the original Loop was deleted), right?

Trying to track when LoopBase objects are created and deleted kind of points at that the first Loop object at 0x891cc60 is deleted when doing loop unswitch on a loop at lower depth in function "f". But I'm not sure exactly how the LoopAnalysisManagerFunctionProxy holding the cached LoopAccessAnalysis based on that pointer is supposed to be informed about that. But I suspect that maybe SimpleLoopUnswitchPass is to blame here? (or something in the pass manager framework)

I guess I'll need to write a PR for this, but kind of messy since getting the crashes is kind of sporadic. So I might need to find a better way to detect that there is a stale analysis (but a few hours ago I did not know much about how these analyses are cached in the new PM at all, so I haven't really known what to look for).

If anyone got some ideas how to debug this further, or what to look for, I'm interested to get some guidance how to continue debugging this (or I'll just have to keep banging my head on the wall trying to understand analysis caches and proxies in new PM tomorrow).

Regards,
Björn

A bug and a repro would be good, even if it’s not consistent. I can take a look, having worked with new PM caches/proxies a bit.
llvm-reduce’ing the IR with a test that runs the pipeline multiple times would be nice.

Perhaps wherever the loop is being deleted, it’s not calling LPMUpdater::markLoopAsDeleted()? LPMUpdater::markLoopAsDeleted() deletes analysis cache entries for the deleted loop.

A quick look at SimpleLoopUnswitch tells me that it is doing some markLoopAsDeleted calls for the base loop.

But when unswitching it may clone/destroy other sub loops (?) as well (there are some calls to LI.destroy(…) but I can’t really see that there is any LPMUpdater calls related to those).

I’ll see if I can reduce and find a better reproducer tomorrow (given the knowledge that I probably should be able to run LoopDistribute → SimpleLoopUnswitch and then somehow detect that there should is a stale analysis it might be a lot easier compared to my earlier attempts).

/Björn

An attempt to solve the problem I was discussing here: https://reviews.llvm.org/D109257

Kind of nasty bug (at least to debug). Maybe this should be merged to LLVM 13 if the patch is ok?

/Björn