TargetRegisterInfo and "infinite" register files

Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does not have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with “virtual” registers instead of “physical” registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new “physical” register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.

Generally stated, I would like to solve the register allocation problem as “allocate the minimum number of registers from an arbitrary set without spill code” instead of the more traditional “allocate the minimum number of registers from a fixed set.”

The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the “optimal” solution in this case would be to extend the register file. A few alternatives I have come up with are:

  1. Bypass register allocation completely and just emit virtual registers,
  2. Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or
  3. Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed
    I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture.

Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.

Thanks!

Justin,

We have the same issue with the AMDIL code generator. We tried #1, but there are passes after register allocator that don’t like virtual registers. #3 could be done by having the two spill functions [load|store]Reg[From|To]StackSlot keep track of the FrameIndex to register mapping internally, but again, more of a hack than a proper solution.

My solution was to just create a very large register file, 768 registers, that no sane kernel would ever reach and then do register allocation within that. A simple script that is run at compile time to generate the tables into a separate .td file and have that included in the necessary locations is all that is needed so it doesn’t bloat the code.

Micah

Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with "virtual" registers instead of "physical" registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.

What kind of optimizations can ptxas do? Can it hoist computations out of loops? Can it split live ranges? Can it coalesce live ranges?

Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set."

It's a common misconception, but that is not what LLVM's register allocators do. They try to minimize the amount of executed spill code given the fixed set of registers.

I wouldn't recommend dynamically growing the register file. You are likely to get super-linear compile time, and it is not clear that register allocation would achieve anything compared to simply outputting virtual registers. Surely, ptxas' register allocator can reuse a register for non-overlapping live ranges. That is all you would get out of this.

The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the "optimal" solution in this case would be to extend the register file. A few alternatives I have come up with are:
  • Bypass register allocation completely and just emit virtual registers,

This is worth a try. It is possible you want to run LLVM's 2-addr, phi-elim, and coalescer passes first.

  • Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or

Don't do that.

  • Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed

This could also work. Spill slots actually do what you want. The register allocator tries to use as few as possible as long as performance doesn't suffer. Later, StackSlotColoring will merge non-overlapping stack slot ranges to save more space.

I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture.

Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.

Sorry to be backwards, but I think you should try (1) or (3).

Simply outputting virtual registers seems like a reasonable thing to to if ptx is really an intermediate form. LLVM's instruction selector and phi-elim tend to emit a lot of copies, so you probably want to run the coalescer before emission. That will minimize the number of copies. This is also the fastest thing you can do.

There are two reasons you may want to run the register allocator anyway:

- Coalescing is very aggressive. It creates long, interfering live ranges. If ptxas doesn't have live range splitting, you may benefit from LLVM's.

- Passes like LICM and CSE will increase register pressure by hoisting redundant computations. If ptxas cannot rematerialize these computations in high register pressure situations, LLVM's register allocator can help you.

Note that if you always make sure there are 'enough' physical registers, the register allocator will never split live ranges or rematerialize computations. That's why (2) doesn't buy you anything over (1).

Use LLVM's register allocator like this:

- Provide a realistic number of physical registers. Make it similar to the target architecture, but aim low.

- Map spill slots to PTX registers. That means 'spilling' is really a noop, except you get live range splitting and remat. If you implement TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be no inserted loads and stores.

The result should be code that is easy to register allocate for ptxas with some live ranges that obviously should go in registers, and some that obviously should spill. There will be a number of live ranges that can go either way, depending on the actual number of registers targeted.

/jakob

Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does not have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with “virtual” registers instead of “physical” registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new “physical” register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.

What kind of optimizations can ptxas do? Can it hoist computations out of loops? Can it split live ranges? Can it coalesce live ranges?

Part of my problem is that ptxas is proprietary software, so it’s essentially a black box to me. It appears to do a reasonable job, but I’ve also seen cases where PTX-level register re-use led to better device register utilization.

Generally stated, I would like to solve the register allocation problem as “allocate the minimum number of registers from an arbitrary set without spill code” instead of the more traditional “allocate the minimum number of registers from a fixed set.”

It’s a common misconception, but that is not what LLVM’s register allocators do. They try to minimize the amount of executed spill code given the fixed set of registers.

I wouldn’t recommend dynamically growing the register file. You are likely to get super-linear compile time, and it is not clear that register allocation would achieve anything compared to simply outputting virtual registers. Surely, ptxas’ register allocator can reuse a register for non-overlapping live ranges. That is all you would get out of this.

That makes sense. If the LLVM register allocators do not actively try to minimize register usage, then I see how there would not be a win here.

The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the “optimal” solution in this case would be to extend the register file. A few alternatives I have come up with are:
• Bypass register allocation completely and just emit virtual registers,

This is worth a try. It is possible you want to run LLVM’s 2-addr, phi-elim, and coalescer passes first.

I definitely need to look into those passes some more. I just hesitate to ignore the LLVM register allocator since I have seen it generate better final code (post-ptxas).

• Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or

Don’t do that.

I see now why that would be sub-optimal.

• Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed

This could also work. Spill slots actually do what you want. The register allocator tries to use as few as possible as long as performance doesn’t suffer. Later, StackSlotColoring will merge non-overlapping stack slot ranges to save more space.

That’s good to know. I was hoping LLVM did something like that, but I have not really scanned that code too thoroughly yet.

I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture.

Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.

Sorry to be backwards, but I think you should try (1) or (3).

Simply outputting virtual registers seems like a reasonable thing to to if ptx is really an intermediate form. LLVM’s instruction selector and phi-elim tend to emit a lot of copies, so you probably want to run the coalescer before emission. That will minimize the number of copies. This is also the fastest thing you can do.

There are two reasons you may want to run the register allocator anyway:

  • Coalescing is very aggressive. It creates long, interfering live ranges. If ptxas doesn’t have live range splitting, you may benefit from LLVM’s.

  • Passes like LICM and CSE will increase register pressure by hoisting redundant computations. If ptxas cannot rematerialize these computations in high register pressure situations, LLVM’s register allocator can help you.

Note that if you always make sure there are ‘enough’ physical registers, the register allocator will never split live ranges or rematerialize computations. That’s why (2) doesn’t buy you anything over (1).

Interesting. I was working under the assumption that the register allocators tried to minimize register use.

Use LLVM’s register allocator like this:

  • Provide a realistic number of physical registers. Make it similar to the target architecture, but aim low.

Sounds reasonable.

  • Map spill slots to PTX registers. That means ‘spilling’ is really a noop, except you get live range splitting and remat. If you implement TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be no inserted loads and stores.

That’s good to know.

The result should be code that is easy to register allocate for ptxas with some live ranges that obviously should go in registers, and some that obviously should spill. There will be a number of live ranges that can go either way, depending on the actual number of registers targeted.

This was definitely very informative! Thanks for the information!

Justin,

We have the same issue with the AMDIL code generator. We tried #1, but there are passes after register allocator that don’t like virtual registers. #3 could be done by having the two spill functions [load|store]Reg[From|To]StackSlot keep track of the FrameIndex to register mapping internally, but again, more of a hack than a proper solution.

After reading Jakob’s comments, I think (3) may end up being the best in the long term. I’ll definitely post any results to the list!

My solution was to just create a very large register file, 768 registers, that no sane kernel would ever reach and then do register allocation within that. A simple script that is run at compile time to generate the tables into a separate .td file and have that included in the necessary locations is all that is needed so it doesn’t bloat the code.

That is essentially what happens now, the only difference being the register description file is generated during dev-time instead of compile-time. I just feel there should be a more “scalable” approach.

I have faced this same problem in my backend, and I’m working around it by providing a large physical register set. There are two problems with this:

  1. There’s a chance that the register allocator will run out of registers to assign, in which case the allocation will fail - making it necessary to retry with a larger register set
  2. The code generator consumes storage proportional to the number of registers that could be assigned

I’d be interested in an improvement to the code generator that makes it possible to specify an infinite register set without the need to store the registers explicitly.

Andrew

As I explained to Justin, allocating against an infinite register file doesn't really do anything. It's a quadratic time no-op.

What you can do instead is:

1) Just use virtual registers and skip register allocation, or

2) Allocate to a small register file, implement memory operand folding, and pretend that spill slots are registers.

/jakob

Empirically, 1) is not true. The linear scan register allocator appears to do a very good job of reusing registers that have been killed. It also automatically inserts copies for operations that clobber register operands, and coalesces identity register moves. 2) is a good idea but when I implemented register allocation, I did not find a straightforward way to integrate spilled memory operands into my instruction set.

Andrew

Good point. The 2-addr, phi-elim, and coalescer passes are definitely helpful.

The final register allocator pass that assigns physical registers probably doesn't help you much.

/jakob

What you can do instead is:

  1. Just use virtual registers and skip register allocation, or

  2. Allocate to a small register file, implement memory operand folding, and pretend that spill slots are registers.

/jakob

Empirically, 1) is not true. The linear scan register allocator appears
to do a very good job of reusing registers that have been killed. It
also automatically inserts copies for operations that clobber register
operands, and coalesces identity register moves.

Good point. The 2-addr, phi-elim, and coalescer passes are definitely helpful.

The final register allocator pass that assigns physical registers probably doesn’t help you much.

I plan on eventually implementing both and seeing which works best for different types of input.

If virtual registers are used, how do you disable final register allocation in the back-end? Looking through the different Target* classes, I do not see any way to disable it. I imagine the TargetRegisterClass implementations are still needed to determine legal virtual register types, but are physical register definitions still needed? This would seem to defeat the purpose of using virtual registers in the first place. Unfortunately, there do not seem to be any documentation (or even existing back-ends) using this approach.

For the stack slot approach, what exactly are the semantics of the foldMemoryOperandImpl method? And how does it relate to the storeRegToStackSlot and readRegFromStackSlot methods? Do the storeReg/readReg methods generate the (load/store) spill code and the foldMemoryOperandImpl method combine the generated loads/stores directly into the instructions that reference them?

I plan on eventually implementing both and seeing which works best for different types of input.

If virtual registers are used, how do you disable final register allocation in the back-end?

If post-RA passes have trouble with virtual registers, you probably need to implement your own addCommonCodeGenPasses() method.

Alternatively, implement a trivial register allocator that simply runs 2-addr, phi-elim, and coalescing.

Looking through the different Target* classes, I do not see any way to disable it. I imagine the TargetRegisterClass implementations are still needed to determine legal virtual register types, but are physical register definitions still needed? This would seem to defeat the purpose of using virtual registers in the first place. Unfortunately, there do not seem to be any documentation (or even existing back-ends) using this approach.

That's right. What you are doing is very different from what a 'real' target requires, so you should probably try to figure out which passes make sense for a GPU back-end.

For the stack slot approach, what exactly are the semantics of the foldMemoryOperandImpl method? And how does it relate to the storeRegToStackSlot and readRegFromStackSlot methods? Do the storeReg/readReg methods generate the (load/store) spill code and the foldMemoryOperandImpl method combine the generated loads/stores directly into the instructions that reference them?

When a register is spilled, the register allocator first tries foldMemoryOperand on all instructions using the register. If successful, the target creates an instruction that accesses the stack slot directly (as is possible on x86 and other CISC architectures). If it fails, the register allocator creates a new tiny live range around the existing instruction, and uses storeRegToStackSlot and readRegFromStackSlot to spill and reload that new register around the instruction.

/jakob

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of Jakob Stoklund Olesen
Sent: Tuesday, May 17, 2011 2:25 PM
To: Justin Holewinski
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] TargetRegisterInfo and "infinite" register files

> I plan on eventually implementing both and seeing which works best
for different types of input.
>
> If virtual registers are used, how do you disable final register
allocation in the back-end?

If post-RA passes have trouble with virtual registers, you probably
need to implement your own addCommonCodeGenPasses() method.

[Villmow, Micah] We did this in our backend from LLVM 2.4-LLVM 2.8, it caused more problems than I can remember because LLVM
changes quite often and many times the functionality that we were relying on was either removed or modified in a way that
didn't work for us, making integration a pain. I'd advise against this approach unless you are willing to
keep track of and maintain all of the changes to LLVMTargetMachine.cpp in your own version.

Alternatively, implement a trivial register allocator that simply runs
2-addr, phi-elim, and coalescing.

> Looking through the different Target* classes, I do not see any way
to disable it. I imagine the TargetRegisterClass implementations are
still needed to determine legal virtual register types, but are
physical register definitions still needed? This would seem to defeat
the purpose of using virtual registers in the first place.
Unfortunately, there do not seem to be any documentation (or even
existing back-ends) using this approach.

That's right. What you are doing is very different from what a 'real'
target requires, so you should probably try to figure out which passes
make sense for a GPU back-end.

[Villmow, Micah] We also tried this approach with the AMD backend, by creating our
own register allocator that just ran a few passes but didn't actually allocate registers out of virtual.
There are passes in the backend that don't like virtual registers, i.e. From MachineLICM.cpp.
"assert(TargetRegisterInfo::isPhysicalRegister(Reg) &&
           "Not expecting virtual register!");"

Hi, Justin

  Have you read Helge Rhodin's thesis "A PTX Code Generator for LLVM"?
He took a quite different approach as yours. I don't know if it can
give you some insight.

  Here is his project website.
  http://sourceforge.net/projects/llvmptxbackend/

Regards,
chenwj