Register Pressure Computation during Pre-Allocation Scheduling

Hi,

We are working on a research project whose objective is developing a pre-allocation scheduling algorithm that achieves the optimal balance between exploiting ILP (hiding latencies) and minimizing register pressure. A prototype of our algorithm has been implemented and integrated into an experimental version of LLVM 2.9. Our algorithm is based on a combinatorial optimization approach, which is naturally slower than heuristic approaches. However, our benchmarking (using SPEC CPU2006) shows that for some (but not all) benchmarks our algorithm can produce significantly faster code with a reasonable increase in compile time if we apply it selectively to the hot spots. Work on generating faster code with less increase in compile time is in progress.

One of the issues that we currently trying to resolve is computing a precise register pressure estimate that correlates very well with the amount of spill code generated by the register allocator (we are using the default linear scan one). Of course, we won’t be able to achieve 100% correlation unless we do allocation and scheduling simultaneously, which is not our current goal. Our current goal is limited to enhancing the pre-allocation scheduling phase to achieve the best possible reduction in register pressure without sacrificing ILP. Note that pre-allocation scheduling is done within the basic block while allocation is done globally for the whole function, which makes it even harder to achieve a good correlation.

One factor that is causing our current register pressure estimate to be off is not being able to properly account for live-in and live-out registers (both virtual and physical). As far as we can tell, LLVM represents live-in regs with CopyFromReg instrs and live-out regs with CopyToReg instrs. However, it looks that in a given basic block, LLVM does not generate CopyToReg instrs for registers that are not defined within that block and does not generate CopyFromReg instsr for regs that are not used within the block. This is causing a problem for us, because a precise register pressure estimate has to take into account all live regs at a given point in the block whether they are defined or used in the block itself or not. Our questions are:

(1) How can we get information about the registers that are live within a basic block but are not defined within the block? And how can we get that information for regs that are not used in the block?

(2) Can we safely assume that CopyFromReg instrs and CopyToReg instrs are used for the sole purpose of representing live-in and live-our regs? In other words, are there other uses for these?
If yes, how do we identify the ones that represent live-in and live-out regs.

(3) The current physical regsiter limits obtained using

TargetRegisterInfo::getRegPressureLimit()
seem to be too low (for example, 3 integer regs on x86 32-bit mode). Are these good limits to use in our case? If not, how can we get better limits?

Thank you in advance!
-Ghassan

One factor that is causing our current register pressure estimate to be off is not being able to properly account for live-in and live-out registers (both virtual and physical). As far as we can tell, LLVM represents live-in regs with CopyFromReg instrs and live-out regs with CopyToReg instrs. However, it looks that in a given basic block, LLVM does not generate CopyToReg instrs for registers that are not defined within that block and does not generate CopyFromReg instsr for regs that are not used within the block. This is causing a problem for us, because a precise register pressure estimate has to take into account all live regs at a given point in the block whether they are defined or used in the block itself or not. Our questions are:

(1) How can we get information about the registers that are live within a basic block but are not defined within the block? And how can we get that information for regs that are not used in the block?

This information is only computed immediately before register allocation. Passes that run after scheduling can significantly change the register pressure. In particular MachineCSE and MachineLICM do this.

We also run value-based copy coalescing before register allocation. That usually reduces register pressure.

(2) Can we safely assume that CopyFromReg instrs and CopyToReg instrs are used for the sole purpose of representing live-in and live-our regs? In other words, are there other uses for these?
If yes, how do we identify the ones that represent live-in and live-out regs.

These DAG nodes are also used to copy to/from physical registers before and after calls.

Virtual registers defined by PHI instructions will also appear as CopyFromReg operands.

(3) The current physical regsiter limits obtained using
TargetRegisterInfo::getRegPressureLimit()
seem to be too low (for example, 3 integer regs on x86 32-bit mode). Are these good limits to use in our case? If not, how can we get better limits?

These limits are low exactly to make room for the global registers passing through a block without being used. The x86-32 architecture has so few registers that you pretty much always want to schedule for register pressure.

Note that LLVM 3.0 has a new register allocator with live range splitting. That makes it less important to accurately model the register pressure from global live ranges. These ranges are likely to be split and spilled anyway, so the schedule should not necessarily make room for them.

/jakob

Thank you for the answers, Jakob! That’s really informative for someone who is still new to LLVM like me. Please see my responses below.

-Ghassan

Thank you for the answers, Jakob! That's really informative for someone who
is still new to LLVM like me. Please see my responses below.
-Ghassan

________________________________
From: Jakob Stoklund Olesen <stoklund@2pi.dk>
To: Ghassan Shobaki <ghassan_shobaki@yahoo.com>
Cc: "llvmdev@cs.uiuc.edu" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, August 16, 2011 12:52 AM
Subject: Re: [LLVMdev] Register Pressure Computation during Pre-Allocation
Scheduling

This information is only computed immediately before register allocation.
Passes that run after scheduling can significantly change the register
pressure. In
particular MachineCSE and MachineLICM do this.

Ghassan: I know that phase ordering is a non-trivial problem that does not
have a perfect solution (like most compiler optimization problems!), but I
wonder why LLVM runs such passes between scheduling and allocation. One
would expect a register pressure reduction pass to be placed as close as
possible to the register allocation pass. Ideally, you would like to have an
integrated algorithm that does allocation and scheduling simultaneously, but
such an integrated solution is usually not implemented due to its
complexity. So, my questions are:
(1) CSE naturally tends to increase reg pressure. Is there any particular
reason for doing CSE between scheduling and allocation instead of doing it,
say before scheduling?

There are essentially three relevant phases to code generation here:
adjustments to LLVM IR (which you probably don't care too much about),
instruction selection which translates IR into MachineInstrs, and
transformations on MachineInstrs. The stuff before and after isel is
reasonably flexible; isel itself is quite rigid. The CSE and LICM
passes in question are MachineInstr transformations; pre-RA scheduling
is essentially the last phase of instruction selection.

(3) Does LLVM have command-line options for turning off phases like CSE and
LICM?

Try passing -help-hidden to llc.

-Eli

I know that phase ordering is a non-trivial problem that does not have a perfect solution (like most compiler optimization problems!), but I wonder why LLVM runs such passes between scheduling and allocation. One would expect a register pressure reduction pass to be placed as close as possible to the register allocation pass.

Since SelectionDAG is a graph and MachineInstr is linear, some kind of scheduling needs to take place when converting the intermediate representation.

We are looking at placing a scheduler between coalescing and register allocation. There are complications such as keeping the LiveIntervals data structure updated, and dealing with code that is no longer in SSA form.

Compile time is also a major concern. Out-of-order processors don’t benefit much from ILP scheduling except in special cases, so it is not clear that an extra scheduler is worth the compile time. That depends on the target machine, of course.

Ideally, you would like to have an integrated algorithm that does allocation and scheduling simultaneously, but such an integrated solution is usually not implemented due to its complexity. So, my questions are:

(1) CSE naturally tends to increase reg pressure. Is there any particular reason for doing CSE between scheduling and allocation instead of doing it, say before scheduling?

CSE and LICM are generally performed on the LLVM IR before instruction selection. The late passes exist to exploit opportunities created by instruction selection.

SelectionDAG is a per-block data structure, so these global passes must run on the MachineInstr representation.

(2) How easy will it be to change the phase ordering in LLVM without breaking things? Where is the phase ordering done? How do we know if there are dependencies among certain phases?

You cannot reorder passes that work on different intermediate representations.

These DAG nodes are also used to copy to/from physical registers before and after calls.
Virtual registers defined by PHI instructions will also appear as CopyFromReg operands.

Ghassan: So, is there a way to distinguish the ones that represent live-in and live-out regs?

Look for virtual registers.

/jakob