[PATCH] RegScavenger::scavengeRegister

This patch adds parameter “EliminateFI” to RegScavenger::scavengeRegister, which tells register scavenger not to eliminate frame index of the emergency spill slot if set to false.

I have pseudo load, store and copy instructions which are generated during register allocation and expanded post-RA but before the final stack size is known. I use register scavenger to search for a temporary integer GPR that is used during pseudo-expansion.

This is what happens during pseudo-expansion:

The following pseudo

LoadAC $acc, FI // Pseudo load instructions. Load from FI to accumulator $acc.

is expanded into this sequence:

LW $reg, FI // load from FI to temporary GPR $reg
MTLO $reg // copy $reg to register LO
LW $reg, FI + 4 // load from FI+4 to GPR $reg
MTHI $reg // copy $reg to register HI

rs-eliminatefi1.patch (2.3 KB)

Hi Akira,

The register scavenger is not really supposed to be used in more than one pass, because you would need to allocate a separate emergency spill slot for each pass.

I think Hal is trying to solve a very similar problem for PPC.

Hal, are you expanding pseudos during PEI?

/jakob

Hi Jakob,

I believe Hal is trying to enable register scavenger to find two (or more) registers that can be used as temporaries.

One problem I see with this approach is that, if you use register scavenger during PEI, you will have to pessimistically set aside two emergency spill slots before you call scavengeRegister, even if it turns out you only need one. Having an extra stack slot might not be a big problem, but still it is nice if we can avoid allocating a slot unnecessarily.

I probably won’t need these pseudo instructions that are expanded post-RA in the first place if I can tell the register allocators to spill accumulator registers to general purpose integer registers instead of directly to stack and disallow copying between accumulator registers. But I guess that is a much more difficult problem to solve. Is that right?

That depends.

The register allocator can spill across register classes, but it calls the functionality "live range splitting" and "register class inflation". Here's how you enable it:

- Define a union register class that contains both CPU64Regs and ACRegs.

- Implement TRI::getLargestLegalSuperClass(), and return the new union register class when asked about CPU64Regs or ACRegs (or their sub-classes).

- Teach TII::copyPhysReg() to handle the cross-class copies.

- Teach TII::storeRegToStackSlot() to constrain the register class to CPU64Regs when asked to spill a virtual register from the union register class.

This will use cross-class spilling in most cases, but unfortunately we can't guarantee that an ACRegs virtual register will never be spilled. This just makes it much less likely to happen.

Targets are still required to be able to spill all legal register classes.

Instead of scavenging for registers during pseudo-expansion, I would like to make it possible to create new virtual registers during spilling. The plan is to give TII::storeRegToStackSlot() permission to:

- Insert multiple instructions at the provided iterator, and

- Create new virtual registers, possibly from different register classes.

I think that functionality would solve your problems, right?

The general idea is that the scavenger should only be used when it is not possible to determine at RA time if a register is needed. That would typically be because the frame layout is not known yet. If a register is always needed, RA should pick it. It is going to do better than the scavenger.

Can you use Hal's scavenger tricks until we get this functionality added to the register allocators? (Help implementing it is always welcome, of course).

/jakob

From: "Jakob Stoklund Olesen" <stoklund@2pi.dk>
To: "Akira Hatanaka" <ahatanak@gmail.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Monday, March 25, 2013 3:06:16 PM
Subject: Re: [LLVMdev] [PATCH] RegScavenger::scavengeRegister

> This patch adds parameter "EliminateFI" to
> RegScavenger::scavengeRegister, which tells register scavenger not
> to eliminate frame index of the emergency spill slot if set to
> false.
>
> I have pseudo load, store and copy instructions which are generated
> during register allocation and expanded post-RA but before the
> final stack size is known. I use register scavenger to search for
> a temporary integer GPR that is used during pseudo-expansion.
>
> This is what happens during pseudo-expansion:
>
> The following pseudo
>
> LoadAC $acc, FI // Pseudo load instructions. Load from FI to
> accumulator $acc.
>
>
> is expanded into this sequence:
>
> LW $reg, FI // load from FI to temporary GPR $reg
> MTLO $reg // copy $reg to register LO
> LW $reg, FI + 4 // load from FI+4 to GPR $reg
> MTHI $reg // copy $reg to register HI

Hi Akira,

The register scavenger is not really supposed to be used in more than
one pass, because you would need to allocate a separate emergency
spill slot for each pass.

I think Hal is trying to solve a very similar problem for PPC.

Hal, are you expanding pseudos during PEI?

In part, yes. It is for eliminating FIs and expanding some pseudos for spilling registers that don't have direct load/store instructions.

-Hal

From: "Akira Hatanaka" <ahatanak@gmail.com>
To: "Jakob Stoklund Olesen" <stoklund@2pi.dk>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Monday, March 25, 2013 3:41:21 PM
Subject: Re: [LLVMdev] [PATCH] RegScavenger::scavengeRegister

Hi Jakob,

I believe Hal is trying to enable register scavenger to find two (or
more) registers that can be used as temporaries.

One problem I see with this approach is that, if you use register
scavenger during PEI, you will have to pessimistically set aside two
emergency spill slots before you call scavengeRegister, even if it
turns out you only need one. Having an extra stack slot might not be
a big problem, but still it is nice if we can avoid allocating a
slot unnecessarily.

True; in general, this only affects us on PPC for functions with large stack frames, and even then only under the rare circumstance where we need to spill condition registers, so adding an extra stack slot does not seem like a big deal. Nevertheless, I view this as a temporary measure until Jakob's proposal to allow virtual registers to be created during spilling can be implemented.

-Hal

Hi Jakob,

I believe Hal is trying to enable register scavenger to find two (or more) registers that can be used as temporaries.

One problem I see with this approach is that, if you use register scavenger during PEI, you will have to pessimistically set aside two emergency spill slots before you call scavengeRegister, even if it turns out you only need one. Having an extra stack slot might not be a big problem, but still it is nice if we can avoid allocating a slot unnecessarily.

I probably won’t need these pseudo instructions that are expanded post-RA in the first place if I can tell the register allocators to spill accumulator registers to general purpose integer registers instead of directly to stack and disallow copying between accumulator registers. But I guess that is a much more difficult problem to solve. Is that right?

That depends.

The register allocator can spill across register classes, but it calls the functionality “live range splitting” and “register class inflation”. Here’s how you enable it:

  • Define a union register class that contains both CPU64Regs and ACRegs.

  • Implement TRI::getLargestLegalSuperClass(), and return the new union register class when asked about CPU64Regs or ACRegs (or their sub-classes).

  • Teach TII::copyPhysReg() to handle the cross-class copies.

  • Teach TII::storeRegToStackSlot() to constrain the register class to CPU64Regs when asked to spill a virtual register from the union register class.

This will use cross-class spilling in most cases, but unfortunately we can’t guarantee that an ACRegs virtual register will never be spilled. This just makes it much less likely to happen.

I will look into this. It will probably alleviate the problem.

Targets are still required to be able to spill all legal register classes.

Instead of scavenging for registers during pseudo-expansion, I would like to make it possible to create new virtual registers during spilling. The plan is to give TII::storeRegToStackSlot() permission to:

  • Insert multiple instructions at the provided iterator, and

  • Create new virtual registers, possibly from different register classes.

I think that functionality would solve your problems, right?

Yes, it sounds like it will solve the problem.

Using the following example where live ranges of accumulators $vreg_acc0 and $vreg_acc1 conflict,

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
(consumer of $vreg_acc0)

if the register can create new virtual registers $vreg_gpr4 and $vreg_gpr5, I think spilling can be avoided:

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
copy $vreg_gpr4, $vreg_acc0:lo // spill lo
copy $vreg_gpr5, $vreg_acc0:hi // spill hi
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
copy $vreg_acc0:lo, $vreg_gpr4 // restore lo
copy $vreg_acc0:hi, $vreg_gpr5 // restore hi
(consumer of $vreg_acc0)

Also, should RA avoid splitting live intervals of accumulators, which creates copy instructions?

The general idea is that the scavenger should only be used when it is not possible to determine at RA time if a register is needed. That would typically be because the frame layout is not known yet. If a register is always needed, RA should pick it. It is going to do better than the scavenger.

Can you use Hal’s scavenger tricks until we get this functionality added to the register allocators? (Help implementing it is always welcome, of course).

Yes, I think I can, but I have to understand details of Hal’s patch first.

Yes, it sounds like it will solve the problem.

Using the following example where live ranges of accumulators $vreg_acc0 and $vreg_acc1 conflict,

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
(consumer of $vreg_acc0)

if the register can create new virtual registers $vreg_gpr4 and $vreg_gpr5, I think spilling can be avoided:

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
copy $vreg_gpr4, $vreg_acc0:lo // spill lo
copy $vreg_gpr5, $vreg_acc0:hi // spill hi
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
copy $vreg_acc0:lo, $vreg_gpr4 // restore lo
copy $vreg_acc0:hi, $vreg_gpr5 // restore hi
(consumer of $vreg_acc0)

The cross class spilling doesn't support spilling to multiple registers, though. I thought you could copy the accumulator to a single 64-bit register.

Also, should RA avoid splitting live intervals of accumulators, which creates copy instructions?

The alternative to live range splitting is spilling, which is usually worse.

/jakob

Yes, it sounds like it will solve the problem.

Using the following example where live ranges of accumulators $vreg_acc0 and $vreg_acc1 conflict,

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
(consumer of $vreg_acc0)

if the register can create new virtual registers $vreg_gpr4 and $vreg_gpr5, I think spilling can be avoided:

MULT $vreg_acc0, $vreg_gpr0, $vreg_gpr1
copy $vreg_gpr4, $vreg_acc0:lo // spill lo
copy $vreg_gpr5, $vreg_acc0:hi // spill hi
MULT $vreg_acc1, $vreg_gpr2, $vreg_gpr3

(consumer of $vreg_acc1)
copy $vreg_acc0:lo, $vreg_gpr4 // restore lo
copy $vreg_acc0:hi, $vreg_gpr5 // restore hi
(consumer of $vreg_acc0)

The cross class spilling doesn’t support spilling to multiple registers, though. I thought you could copy the accumulator to a single 64-bit register.

The size of general purpose integer registers for mips32 is 32-bit and accumulators are 64-bit registers consisting of 32-bit hi/lo register pairs. So you will need two instructions to copy two 32-bit GPR registers to a 64-bit accumulator register. If spilling to multiple registers is unsupported, perhaps I can I define a new register class consisting of paired GPR registers and pseudo copy instructions?

Also, should RA avoid splitting live intervals of accumulators, which creates copy instructions?

The alternative to live range splitting is spilling, which is usually worse.

Here I was assuming register allocator will spill accumulator registers to integer registers instead of directly to stack. In that case, splitting might be worse than spilling since reload requires two GPR-to-accumulator copy instructions while copying one accumulator to another requires four copy instructions (instruction set doesn’t have any accumulator-to-accumulator copy instructions):

copy $vreg_gpr0, $vreg_acc0:lo
copy $vreg_gpr1, $vreg_acc0:hi
copy $vreg_acc1:lo, $vreg_gpr0
copy $vreg_acc1:hi, $vreg_gpr1

The size of general purpose integer registers for mips32 is 32-bit and accumulators are 64-bit registers consisting of 32-bit hi/lo register pairs. So you will need two instructions to copy two 32-bit GPR registers to a 64-bit accumulator register. If spilling to multiple registers is unsupported, perhaps I can I define a new register class consisting of paired GPR registers and pseudo copy instructions?

Yes, that would work. Look at the ARM GPRPair register class.

You don't need pseudo copy instructions, TII::copyPhysReg() is allowed to insert multiple instructions.

> Also, should RA avoid splitting live intervals of accumulators, which creates copy instructions?

The alternative to live range splitting is spilling, which is usually worse.

Here I was assuming register allocator will spill accumulator registers to integer registers instead of directly to stack. In that case, splitting might be worse than spilling since reload requires two GPR-to-accumulator copy instructions while copying one accumulator to another requires four copy instructions (instruction set doesn't have any accumulator-to-accumulator copy instructions):

copy $vreg_gpr0, $vreg_acc0:lo
copy $vreg_gpr1, $vreg_acc0:hi
copy $vreg_acc1:lo, $vreg_gpr0
copy $vreg_acc1:hi, $vreg_gpr1

There is no way of preventing copies completely.

Once you create a virtual register with some register class, you must support spilling, reloading, and copying of that register class.

The cross class spilling feature can make these things less frequent, but it can't make them go away completely.

/jakob

From: "Jakob Stoklund Olesen" <stoklund@2pi.dk>
To: "Akira Hatanaka" <ahatanak@gmail.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Tuesday, March 26, 2013 12:40:44 PM
Subject: Re: [LLVMdev] [PATCH] RegScavenger::scavengeRegister

> The size of general purpose integer registers for mips32 is 32-bit
> and accumulators are 64-bit registers consisting of 32-bit hi/lo
> register pairs. So you will need two instructions to copy two
> 32-bit GPR registers to a 64-bit accumulator register. If spilling
> to multiple registers is unsupported, perhaps I can I define a new
> register class consisting of paired GPR registers and pseudo copy
> instructions?

Yes, that would work. Look at the ARM GPRPair register class.

In the PPC case, for CR spilling, the CRs are only really 4 bits; I'd need to define 8-tuples just to pair with 32-bit GPR registers. Under normal circumstances, i32 is the smallest legal type. Instead, to make this work, would I define a 4-bit subregister of the GPRs to hold the data (or maybe 8 bits, and then also use some CR-pair register class)?

Thanks again,
Hal

No, that shouldn't be necessary.

The types on register classes are only used by isel, and this register class wouldn't be used before register allocation. Just make it [i32], or even [untyped]. (The type is used to pick a default spill size, so you may need to 'let Size = 4' if you go with untyped).

Your implementation in copyPhysReg is the final word on what it means to copy between registers in this class.

The register class will not be used automatically without permission from your implementation of getLargestLegalSuperClass. This function should not allow normal GPR registers to be inflated to the GPR+CR super-class because not all registers in that class have enough bits.

X86RegisterInfo::getLargestLegalSuperClass() does something similar with the GR8_NOREX register class to work around some awkward x86 encoding issues with the 20 8-bit registers.

/jakob

From: "Jakob Stoklund Olesen" <stoklund@2pi.dk>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Akira Hatanaka" <ahatanak@gmail.com>
Sent: Saturday, April 6, 2013 11:56:28 AM
Subject: Re: [LLVMdev] [PATCH] RegScavenger::scavengeRegister

>> From: "Jakob Stoklund Olesen" <stoklund@2pi.dk>
>> To: "Akira Hatanaka" <ahatanak@gmail.com>
>> Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Hal
>> Finkel" <hfinkel@anl.gov>
>> Sent: Tuesday, March 26, 2013 12:40:44 PM
>> Subject: Re: [LLVMdev] [PATCH] RegScavenger::scavengeRegister
>>
>>
>>
>>> The size of general purpose integer registers for mips32 is
>>> 32-bit
>>> and accumulators are 64-bit registers consisting of 32-bit hi/lo
>>> register pairs. So you will need two instructions to copy two
>>> 32-bit GPR registers to a 64-bit accumulator register. If
>>> spilling
>>> to multiple registers is unsupported, perhaps I can I define a
>>> new
>>> register class consisting of paired GPR registers and pseudo copy
>>> instructions?
>>
>> Yes, that would work. Look at the ARM GPRPair register class.
>
> In the PPC case, for CR spilling, the CRs are only really 4 bits;
> I'd need to define 8-tuples just to pair with 32-bit GPR
> registers. Under normal circumstances, i32 is the smallest legal
> type. Instead, to make this work, would I define a 4-bit
> subregister of the GPRs to hold the data (or maybe 8 bits, and
> then also use some CR-pair register class)?

No, that shouldn't be necessary.

The types on register classes are only used by isel, and this
register class wouldn't be used before register allocation. Just
make it [i32], or even [untyped]. (The type is used to pick a
default spill size, so you may need to 'let Size = 4' if you go with
untyped).

Your implementation in copyPhysReg is the final word on what it means
to copy between registers in this class.

The register class will not be used automatically without permission
from your implementation of getLargestLegalSuperClass. This function
should not allow normal GPR registers to be inflated to the GPR+CR
super-class because not all registers in that class have enough
bits.

X86RegisterInfo::getLargestLegalSuperClass() does something similar
with the GR8_NOREX register class to work around some awkward x86
encoding issues with the 20 8-bit registers.

Okay, thanks! So when the RA decides to use this register inflation mechanism, it decides on both the inflation point and the deflation point at the same time? I'd like to understand this better, can you please provide a pointer to the logic for this?

-Hal

When the register allocator runs out of registers, it tries splitting the live range it is currently looking at (RAGreedy::selectOrSplit). If splitting was successful, there may be an opportunity to choose a larger register class for some of the parts. The original register class was selected as the intersection of the constraints imposed by all the users of the virtual register. The new live ranges each have fewer users, so they may be less constrained.

The splitter calls LiveRangeEdit::calculateRegClassAndHint() which calls MachineRegisterInfo::recomputeRegClass() for each of the new virtual registers created by splitting an old one.

The new register class is computed by starting from getLargestLegalSuperClass(OldRC) and then applying the constraints from each use in turn.

The register allocator is more aggressive about splitting live ranges when it sees a register class that is inflatable. See the checks for isProperSubClass() here and there. In particular, it may split around a single instruction, and that is how the cross-class spilling works.

Suppose:

  %vreg0 = CMP ...; %vreg0:CRRC
...
  BRANCH %vreg0; %vreg0:CRRC

Live range splitting can split around the individual instructions to produce:

  %vreg1 = CMP ...; %vreg1:CRRC
  %vreg2 = COPY %vreg1
...
  %vreg3 = COPY %vreg2; %vreg3:CRRC
  BRANCH %vreg3; %vreg3:CRRC
  
The tiny live ranges %vreg1 and %vreg3 are constrained to CRRC by their users, but %vreg2 is only used by COPY instructions which don't impose any constraints.

This means the %vreg2 gets whatever register class is returned by getLargestLegalSuperClass(CRRC). If that class contains the GPRs there is a good chance we will be able to find a register.

Now, the same thing can happen when splitting a GPRC register, and so it is important that getLargestLegalSuperClass(GPRC) only returns register classes with all 32/64 bits.

/jakob

Is there a way to force or increase the likelihood of rematerialization of condition registers, instead of copying or spilling?

mips isa doesn’t have a handy instruction which enables reading or writing condition registers in one or two cycles (you need several shift and masking operations), so it makes more sense to rematerialize, if that’s possible.

Cheap-as-a-copy remat is always preferred to inserting a COPY instruction, but only if the operands to the rematted instructions are already available at the new location.

RA won't extend live ranges for a remat, but I can see how that would be useful in your case.

It gets complicated if the virtual registers used have multiple values - you would need to prove that the same value is available at the new location.

LiveRangeEdit has the remat code.

/jakob