I would like to restrict LICM in the number of memory operations it hoists out of a loop. Blindly hoisting out all instructions out could result in a lot of spilling/reloading, which is what I’d like to avoid. We see this for example on (inner)loops with small constant bounds that are unrolled. I am drafting something in https://reviews.llvm.org/D92488, which contains a reduced version of my motivating example.
It was brought to my attention that LICM might not have any restrictions by design because hoisting out all instructions could be some sort of canonical form that other passes depend on, so sinking back defs closer to their user(s) might be a back-end problem. I was wondering if there are any opinions on this.
I think the issue is not specific to LICM and can be observed with other transforms like the SCEV expander. This was brought up in one of the Loop Optimization WG calls before http://lists.llvm.org/pipermail/llvm-dev/2019-September/135058.html as well. The general feedback at the time was that it would be better to deal with the issue in the back-end perhaps through rematerialization of the hoisted instructions in RA using LiveRangeEdit.
I don’t have much in-depth knowledge in this area, but the blocker issue at that time appeared to be that the register allocator could only rematerialize single instructions and we needed groups of instructions to be moved/copied. Does anyone know if any progress is under way in that area?
Bardia Mahjour
Compiler Optimizations
IBM Toronto Software Lab
Sjoerd Meijer via llvm-dev —2020/12/07 08:25:47 AM—Hello, I would like to restrict LICM in the number of memory operations it hoists out of a loop. Bli
I would like to restrict LICM in the number of memory operations it hoists out of a loop. Blindly hoisting out all instructions out could result in a lot of spilling/reloading, which is what I’d like to avoid. We see this for example on (inner)loops with small constant bounds that are unrolled. I am drafting something in https://reviews.llvm.org/D92488, which contains a reduced version of my motivating example.
It was brought to my attention that LICM might not have any restrictions by design because hoisting out all instructions could be some sort of canonical form that other passes depend on, so sinking back defs closer to their user(s) might be a back-end problem. I was wondering if there are any opinions on this.
I think one key benefit of hoisting as much as possible is that it enables/simplifies subsequent optimizations in the middle-end (especially for memory operations). Limiting hoisting is likely to have a bad knock-on effect on other transformations.
Accurately estimating the number of spills in LICM is probably going to be tricky and/or brittle. Another thing to consider is that there are plenty of other transformations that extend live-ranges of values, so they would also need to also need updating.
Sinking in the backend would catch those cases naturally, while making our lives easier in the middle-end. We already have plenty of infrastructure to reason about register pressure & co in CodeGen. As Bardia mentioned, currently there is limited support for re-materialzing instructions.
Besides that, I don’t think there are dedicated passes to move defs to uses to reduce register pressure, beside the MachineScheduler. But the MachineScheduler currently only operates on sub-regions of basic blocks.
I think there would be potential for having a dedicated pass to move defs across basic blocks to reduce register pressure. I think we have most building blocks (RegisterPressureTracker, MachineTraceMetrics for selecting a likely trace through the function). But reasoning about moving memory operations is probably going to be more difficult in the backend than in the middle-end though unfortunately.
Many thanks for the feedback! I will investigate machineLICM first.
FYI, I have documented that LICM is also a canonicalization transform in https://reviews.llvm.org/rG1e260f955d31.
I'd explored the notion of doing a register pressure aware global code motion a while back. There's a bunch of hard cases, but there's also a bunch of obvious profitable stuff too. I think this is entirely realistic if someone wanted to devote to time.
I doubt this would solve the triggering case for this thread though. Despite the framing as being about LICM, the transform actually involved was store promotion. I suspect the issue is that we simply haven't considered undoing store promotion in MachineLICM. (I haven't looked, so this may be a false statement.)
I would like to restrict LICM in the number of memory operations it hoists out of a loop. Blindly hoisting out all instructions out could result in a lot of spilling/reloading, which is what I'd like to avoid. We see this for example on (inner)loops with small constant bounds that are unrolled. I am drafting something in ⚙ D92488 [LICM] Add a maximum for the number of instructions to be hoisted, which contains a reduced version of my motivating example.
It was brought to my attention that LICM might not have any restrictions by design because hoisting out all instructions could be some sort of canonical form that other passes depend on, so sinking back defs closer to their user(s) might be a back-end problem. I was wondering if there are any opinions on this.
I think one key benefit of hoisting as much as possible is that it enables/simplifies subsequent optimizations in the middle-end (especially for memory operations). Limiting hoisting is likely to have a bad knock-on effect on other transformations.
Accurately estimating the number of spills in LICM is probably going to be tricky and/or brittle. Another thing to consider is that there are plenty of other transformations that extend live-ranges of values, so they would also need to also need updating.
Sinking in the backend would catch those cases naturally, while making our lives easier in the middle-end. We already have plenty of infrastructure to reason about register pressure & co in CodeGen. As Bardia mentioned, currently there is limited support for re-materialzing instructions.
Besides that, I don’t think there are dedicated passes to move defs to uses to reduce register pressure, beside the MachineScheduler. But the MachineScheduler currently only operates on sub-regions of basic blocks.
I think there would be potential for having a dedicated pass to move defs across basic blocks to reduce register pressure. I think we have most building blocks (RegisterPressureTracker, MachineTraceMetrics for selecting a likely trace through the function). But reasoning about moving memory operations is probably going to be more difficult in the backend than in the middle-end though unfortunately.
+1
Assuming the transformation is not irreversible, I would continue to perform normalization steps like LICM and opt (pun intended) for a dedicated, target-specific pass for register pressure minimization.
Years ago I tried to estimate register pressure/usage in the vectorizer, in addition to the heuristic the vectorizer uses, and it turned out that you might get close but won't get it right at that level anyway... at least that is my experience.
~ Johannes
P.S.
I know I'm late and people seem to have already converged on this conclusion anyway.
Note that we have exactly the “focus on easy cases” pass you’re all discussion: LoopSink. It is scheduled late in the optimization phase after canonicalization is complete. Probably can use some extra work though? It was definitely targeting exactly the kinds of cases raised here.
To add to the passes that increase register pressure in loops: GVN
partial redundancy elimination (including LoadPRE),
LoopLoadElimination.
The current pass design seems to rely on rematerialization/reload
reducing register pressure to happen at a later stage, such as
register allocation and LoopSinkPass. The latter has the following
code in it:
// Enable LoopSink only when runtime profile is available.
// With static profile, the sinking decision may be sub-optimal.
if (!Preheader->getParent()->hasProfileData())
continue;
I.e. it is not a general purpose pass to decrease register pressure in loops.