"Machine LICM" for Constants?

Hi All,
     I work on a backend for a target similar to Mips, where large immediates are loaded into registers with 2 instructions, 1 to load the MSBits and 1 to load the LSBits. I've noticed a recurring pattern where, despite low register pressure, these constants will be rematerialized in every iteration of a loop, rather than being hoisted. Here's an example using the mips-unknown-unknown target and Clang/LLVM HEAD. From newlib's implementation of strncat:

#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
while (!DETECTNULL (*aligned_s1))
   aligned_s1++;

This loop gets lowered under -O3 to:

$BB0_5:
   lui $3, 32896
   lui $7, 65278
   ori $3, $3, 32896 ###### Materialize 0x80808080
   lw $8, 4($2)
   nop
   and $9, $8, $3
   ori $7, $7, 65279 ###### Materialize -(0x01010101)
   addiu $2, $2, 4
   xor $3, $9, $3
   addu $7, $8, $7
   and $3, $3, $7
   beq $3, $zero, $BB0_5

There are a ton of unused caller-saved registers in this small function, so I expected the constant materialization to be hoisted out of the tight loop. I'm still learning about the new register allocator and am not immediately able to make sense of its debug output (and the 'problem' may be elsewhere in any case). I'm happy to post the results of -debug-only regalloc if they're useful.

Is my desire to hoist the constants out of the loop reasonable? Is there something I can do (hints or passes in my backend, clang/opt flag, etc.) to make this happen today? If not, what is the root cause? Maybe there's no way to hoist things out of a loop once IR is lowered into a SelectionDAG?

Thanks,
Matt

Yes machine-licm can and should hoist constant materialization instructions out of the loop. If it's not doing that, it's probably because the target is not modeling the instruction correctly. I would walk through MachineLICM::IsLoopInvariantInst() in the debugger to figure it out. You can also try compiling the same bitcode for a target like ARM or X86 as a comparison.

Evan

Thanks for the tip! I looked into it and it looks like the problem as of SVN HEAD is that the lui and ori instructions in Mips are considered cheap (1-cycle def-use latency) by MachineLICM::IsCheapInstruction(), but are not trivially materializable because their register operands are not always available. This makes MachineLICM::IsProfitableToHoist() return false, preventing the hoist even though MachineLICM::IsLoopInvariantInst() returns true.

The comment in IsProfitableToHoist() is:

// If the instruction is cheap, only hoist if it is re-materilizable [sic]. LICM
// will increase register pressure. It's probably not worth it if the
// instruction is cheap.

The function then proceeds to actually *estimate* register pressure for non-cheap instructions to determine whether or not to hoist them.
This heuristic seems reasonable, but doesn't seem to do the right thing in this case. Hacking the instruction itineraries to make the instructions not seem cheap doesn't seem like the right answer either. I'm guessing the motivation for this heuristic is that, in a loop with many possible hoists, some cheap and some expensive, we would prefer to hoist the expensive ones rather than wasting all our register slack on the cheap ones.

Is there another way to accomplish this goal while still performing the hoist in situations where register pressure is low enough? Say, considering the instructions in a loop for hoisting in descending order of cost, rather than in program order?

Note that ARM gets around this by creating a pseudo-instruction for 32-bit immediate loads (MOVi32imm) , rather than putting a pattern directly in ARMInstrInfo.td. This fused instruction *is* rematerializable (since it defines the entire register), even though either of the two half-register instructions by themselves cannot be. This is one way my target and Mips could hack around the problem, but for my target at least it has the disadvantage of having to add an ExpandPseudo pass to my backend and put logic in C++ that seems (IMO) to belong in TableGen.

-Matt

Thanks for the tip! I looked into it and it looks like the problem as of SVN HEAD is that the lui and ori instructions in Mips are considered cheap (1-cycle def-use latency) by MachineLICM::IsCheapInstruction(), but are not trivially materializable because their register operands are not always available. This makes MachineLICM::IsProfitableToHoist() return false, preventing the hoist even though MachineLICM::IsLoopInvariantInst() returns true.

The comment in IsProfitableToHoist() is:

// If the instruction is cheap, only hoist if it is re-materilizable [sic]. LICM
// will increase register pressure. It's probably not worth it if the
// instruction is cheap.

The function then proceeds to actually *estimate* register pressure for non-cheap instructions to determine whether or not to hoist them.
This heuristic seems reasonable, but doesn't seem to do the right thing in this case. Hacking the instruction itineraries to make the instructions not seem cheap doesn't seem like the right answer either. I'm guessing the motivation for this heuristic is that, in a loop with many possible hoists, some cheap and some expensive, we would prefer to hoist the expensive ones rather than wasting all our register slack on the cheap ones.

Is there another way to accomplish this goal while still performing the hoist in situations where register pressure is low enough? Say, considering the instructions in a loop for hoisting in descending order of cost, rather than in program order?

Note that ARM gets around this by creating a pseudo-instruction for 32-bit immediate loads (MOVi32imm) , rather than putting a pattern directly in ARMInstrInfo.td. This fused instruction *is* rematerializable (since it defines the entire register), even though either of the two half-register instructions by themselves cannot be. This is one way my target and Mips could hack around the problem, but for my target at least it has the disadvantage of having to add an ExpandPseudo pass to my backend and put logic in C++ that seems (IMO) to belong in TableGen.

The pseudo instruction approach is the only quick solution I can think of. MachineLICM is designed to be conservative so I don't think we should change the register pressure heuristics.

Evan