I am in the process of updating our out-of-tree implementation from v5.0 to
v6.0 RC3, and while it builds and mostly runs, I am having trouble with a
small number of tests where the 'WorklistMap' in 'DAGCombiner::Run' never
becomes empty. This is resulting in a runaway state of continuous heap
allocation until the process exhausts all system memory.
But I can't get a handle on why it is doing this, and it is not obvious to
me that the changes between v5.0 and v6.0 RC3 invalidate our implementation
in a way that might cause this. The only time I see our code entered is
when lowering is called for vector element insert by 'LegalizeOp'. Does
anybody have an advice on how I should approach debugging this?
I suspect this is an issue with post-DAG legalization store merging in the DAGCombiner. If you have a custom lowered type the DAGCombiner may end up merging a set of stores and immediately splitting them up in legalization. You should be able to disable this pass universally by overriding mergeStoresAfterLegalization() or conditionally for cases that shouldn’t match with canMergeStoresTo.
You should able able to verify by finding the loop of nodes considered with “-debug” on.
Thanks for this advice Nirav. I have only just returned to working on this after being side-tracked to other tasks.
All the best,
We discovered what is happening.
SDAGCombiner essentially looks at various combinations of nodes to do with vectors, and when it can, it creates a vector shuffle. The problem is, that our vector shuffle lowering builds new trees with vector element, or vector sub-vector insert sequences. The generic DAGCombiner, reconstructs these into a new shuffle, and so the loop continues - we reduce it, and DAGCombiner re-abstracts it.
Our shuffle lowering produces (produced) very optimal code sequences for our target, and has not been changed significantly since LLVM v3.4; but changes between v5.0 and v6.0 have introduced this DAG reduction dependency loop.
Is there any advice to Out-of-Tree implementations about how to re-write their lowering code for shuffle so as to avoid this kind of infinite dependency coupling?
It sounds like you are doing is more akin to shuffle selection than fusion and therefore it’s a better fit for instruction selection than DAGCombining. Try movign it to ISelDAGToDAG’s Select (or potentially PreprocessISelDAG).
We do something very similar in the PPC back end. The ISA has a general vector permute (vperm) that any shuffle can be lowered to. But there are also a bunch of specialized permutes that don’t require materializing a permute control vector. So what we actually do is a custom legalization for shuffles where we will lower shuffles to various PPC-specific ISD nodes, etc.