Back end loop invariant opt

Hi All,

It seems to me that there is potential for doing some target independent loop invariant optimizations in the back end, but prior to instruction selection. For instance, I noticed that the loop invariant constant generated for a loop bounds check is still stuck inside the loop. Likewise constant address generated for globals accessed within a loop are generated on each iteration, even though they are invariant.

I’m most familiar with LLVM’s target interface and code gen and I’m not fully up to speed on the optimization framework. Is there an existing optimization pass that handles these cases that I’ve failed to properly enable or hook into? Is there something that inherently prevents these from being target independent optimizations, requiring a target specific opt pass? Is this simply an optimization pass that hasn’t been implemented yet, but is a good idea?

Thanks

It seems to me that there is potential for doing some target independent loop invariant optimizations in the back end, but prior to instruction selection.

I assume you mean after instruction selection?

For instance, I noticed that the loop invariant constant generated for a loop bounds check is still stuck inside the loop. Likewise constant address generated for globals accessed within a loop are generated on each iteration, even though they are invariant.

Yes, this is a known issue. The plan is to implement a machine-code level pass that hoists machine instrs out of loops after instruction selection has been performed. This is important, because isel is the pass that exposes most of the issues that need hoisting (e.g. load immediate instructions, etc).

I'm most familiar with LLVM's target interface and code gen and I'm not fully up to speed on the optimization framework. Is there an existing optimization pass that handles these cases that I've failed to properly enable or hook into? Is there something that inherently prevents these from being target independent optimizations, requiring a target specific opt pass? Is this simply an optimization pass that hasn't been implemented yet, but is a good idea?

The later. On targets with few registers, LICM can be very bad. Our plan is to implement rematerialization in the register allocator to counteract this. In any case, I think that LICM should definitely be done, and targets can choose to use it or not based on their register file, before remat is done.

-Chris

It seems to me that there is potential for doing some target independent loop
invariant optimizations in the back end, but prior to instruction selection.

I assume you mean after instruction selection?

I assumed that to be target independent the pass would need to be before instruction selection, but I guess if the pass more like the basic block/ branch analysis that would make sense.

For instance, I noticed that the loop invariant constant generated for a loop
bounds check is still stuck inside the loop. Likewise constant address
generated for globals accessed within a loop are generated on each iteration,
even though they are invariant.

Yes, this is a known issue. The plan is to implement a machine-code level
pass that hoists machine instrs out of loops after instruction selection
has been performed. This is important, because isel is the pass that
exposes most of the issues that need hoisting (e.g. load immediate
instructions, etc).

Isn’t it possible to tell that certain DAG nodes are loop invariant prior to instruction selection, such as a constant or global address node?

I’m most familiar with LLVM’s target interface and code gen and I’m not fully
up to speed on the optimization framework. Is there an existing optimization
pass that handles these cases that I’ve failed to properly enable or hook
into? Is there something that inherently prevents these from being target
independent optimizations, requiring a target specific opt pass? Is this
simply an optimization pass that hasn’t been implemented yet, but is a good
idea?

The later. On targets with few registers, LICM can be very bad. Our plan
is to implement rematerialization in the register allocator to counteract
this. In any case, I think that LICM should definitely be done, and
targets can choose to use it or not based on their register file, before
remat is done.

Is there more information (PR?) on the schedule for this work and who is involved? I may not be experienced enough to make much of a contribution yet, but I’d like to keep tabs on it.

Only in part. Selection of some nodes generates duplicate code in
some backends and not others. A global that requires 2 instructions
to load, as on alpha (where one of those instructions may sometimes be
redundant with other loads of globals, but not always), may benfit
from hoisting whereas on some architectures hoisting the load out may
increase register pressure with little benefit (or the addressing is
trivially folded into the load, etc). The current thinking is a LICM
and CSE type pass needs to operate on machine code (in a target
independent way) so that the particularities of each architecture are
exposed (by virtue of having the machine code selected). There isn't
also a way to remove duplicate instructions that are introduced as an
artifact of the architecture in the DAG stage. DAG nodes are not a
1:1 mapping onto selected instructions.

Andrew

I assume you mean after instruction selection?

I assumed that to be target independent the pass would need to be before instruction selection, but I guess if the pass more like the basic block/ branch analysis that would make sense.

Right. We already have LICM that runs before isel. THe problem is that isel exposes new opportunities for LICM.

Yes, this is a known issue. The plan is to implement a machine-code level
pass that hoists machine instrs out of loops after instruction selection
has been performed. This is important, because isel is the pass that
exposes most of the issues that need hoisting (e.g. load immediate
instructions, etc).

Isn't it possible to tell that certain DAG nodes are loop invariant prior to instruction selection, such as a constant or global address node?

Yes. The problem is that selectiondags are currently only built per-basic block, so you can't do whole-function code motion xforms on them. In the long term, this is certainly something I want to fix. See Cliff Click's thesis for some of the things you can do with whole-function dags.

targets can choose to use it or not based on their register file, before
remat is done.

Is there more information (PR?) on the schedule for this work and who is involved? I may not be experienced enough to make much of a contribution yet, but I'd like to keep tabs on it.

We have very simple remat implemented, basically for instructions that have no input registers and define a single output register. We don't have any immediately-short-term plans to implement either machine code LICM or the rest of remat.

In the medium term (6-18 months), I'd like to start a major redesign project for the register allocator. The goal would be to build a framework to handle standard techniques like shrink wrapping, live rangle splitting, remat, ra resched, etc.

In the shorter term, there are still lots of other things that can be improved, on shorter time scales, for now we are focusing on those.

-Chris

Andrew is right, but after the isel pass runs, there *should* be a 1-1 mapping between target dag nodes and machine instrs. I say "should", because we don't always have that at this point, but it is a goal going forward.

-Chris