Fast register allocation

As you may have noticed, I have been working on a fast register allocator in the last week. The new allocator is intended to replace the local allocator for debug builds.

Both allocators use a local allocation strategy. Each basic block is scanned from top to bottom, and virtual registers are assigned to physical registers as they appear. There are no live registers between blocks. Everything is spilled at the end of each block. This strategy works well for unoptimized code where live ranges are short.

The fast allocator uses a few tricks to run faster than the local allocator while producing slightly better code.

When RALocal allocates a physical register, it first checks if any aliases are in use to avoid clobbering a subregister. RAFast uses a working set strategy where recently killed registers are kept around for quick allocation without the alias scan.

RALocal uses a least-recently-used algorithm for allocating registers. RAFast uses a whatever-I-can-get-my-hands-on-quickly algorithm.

RALocal folds spills and restores into other instructions. RAFast doesn't bother because debug builds have very few spills.

RAFast uses more aggressive hinting by peeking at future instructions. This helps reduce the number of register copies when setting up parameters for a call:

  %reg1028<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
  %reg1029<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
  ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
  %reg1030<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
  %reg1031<def> = MOV8r0 %EFLAGS<imp-def,dead>
  %RDI<def> = MOV64rr %reg1030
  %ESI<def> = MOV32rr %reg1028
  %EDX<def> = MOV32rr %reg1029
  %AL<def> = MOV8rr %reg1031
  CALL64pcrel32 <ga:@printf>, %RDI, %ESI, %EDX, %AL, %RAX<imp-def>, %RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ...

When finding a register for %reg1028, RAFast sees that it is later copied to %ESI. It is allocated %ESI and the copy disappears:

  %ESI<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
  %EDX<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
  ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
  %RDI<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
  %AL<def> = MOV8r0 %EFLAGS<imp-def,dead>
  CALL64pcrel32 <ga:@printf>, %RDI<kill>, %ESI<kill>, %EDX<kill>, %AL<kill>, %RAX<imp-def>, %RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ...

The aggressive coalescing also reduces register pressure, and there is a lot less spilling.
In a debug build of SPECCPU2006/403.gcc it results in 8% less instructions:

Local Fast
------- -------
153623 258567 regalloc - Number of copies coalesced
  24975 18381 regalloc - Number of loads added
  19361 13864 regalloc - Number of stores added
1109880 1018144 asm-printer - Number of machine instrs printed

The fast allocator is about three times faster than the old local allocator.
It reduces the overall code generation time from 7.0s to 5.9s for 403.gcc:

RALocal: Total Execution Time: 6.9976 seconds (7.1717 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
2.7376 ( 42.1%) 0.2554 ( 51.8%) 2.9931 ( 42.8%) 2.9930 ( 41.7%) X86 DAG->DAG Instruction Selection
1.4644 ( 22.5%) 0.0073 ( 1.5%) 1.4717 ( 21.0%) 1.4714 ( 20.5%) Local Register Allocator
0.5407 ( 8.3%) 0.1298 ( 26.3%) 0.6705 ( 9.6%) 0.8463 ( 11.8%) X86 AT&T-Style Assembly Printer

RAFast: Total Execution Time: 5.9255 seconds (6.0734 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
2.7223 ( 50.1%) 0.2574 ( 52.5%) 2.9796 ( 50.3%) 2.9796 ( 49.1%) X86 DAG->DAG Instruction Selection
0.5241 ( 9.6%) 0.1237 ( 25.2%) 0.6479 ( 10.9%) 0.8003 ( 13.2%) X86 AT&T-Style Assembly Printer
0.4526 ( 8.3%) 0.0063 ( 1.3%) 0.4589 ( 7.7%) 0.4584 ( 7.5%) Fast Register Allocator

The fast allocator can be enabled for clang -O0 with a small patch:

diff --git a/lib/Frontend/CodeGenAction.cpp b/lib/Frontend/CodeGenAction.cpp
index 86005f2..cd2d2f0 100644
--- a/lib/Frontend/CodeGenAction.cpp
+++ b/lib/Frontend/CodeGenAction.cpp
@@ -322,7 +322,7 @@ bool BackendConsumer::AddEmitPasses() {

   // Set register scheduler & allocation policy.
   RegisterScheduler::setDefault(createDefaultScheduler);
- RegisterRegAlloc::setDefault(Fast ? createLocalRegisterAllocator :
+ RegisterRegAlloc::setDefault(Fast ? createFastRegisterAllocator :
                                createLinearScanRegisterAllocator);

   // Create the code generator passes.

I have tested clang self hosting on x86-64 and x64 with this patch.
On the nightly test suite, -regalloc=fast performs as well as -regalloc=local. See PR7154.

I expect we may see some problems with the register scavenger on ARM because -regalloc=fast doesn't currently add <imp-def,dead> flags on call clobbered registers. That increases register pressure quite a bit as seen by the scavenger.

Share and Enjoy!
/jakob

This is great work Jakob! What is required and when do you think you'll be killing off the local allocator?

-Chris

I think we should switch clang -O0 to the fast allocator right away. It is the only way of smoking out the remaining bugs.

I am not sure if or when llvm-gcc wants to switch, but when there are no more clients in the tree, we can kill off the local allocator.

/jakob

You mention some potential issues on ARM, should we sort those out
before we enable it in Clang? It's somewhat more convenient to have
things be consistent.

I propose that at some point you just replace the old local register
allocator with the new one, and rename the old one. Then clients don't
need to change, and we can keep the old one available for a little
while if we want, for testing.

- Daniel

You mention some potential issues on ARM, should we sort those out
before we enable it in Clang? It's somewhat more convenient to have
things be consistent.

Sure, I'll be testing some ARM code today. There is also Evan's new REG_SEQUENCE instruction to verify.

I propose that at some point you just replace the old local register
allocator with the new one, and rename the old one. Then clients don't
need to change, and we can keep the old one available for a little
while if we want, for testing.

I am not a big fan of renaming - it can easily cause confusion. Besides, fast-alloc goes with fast-isel.

But perhaps we could take this opportunity to let LLVMTargetMachine choose a register allocator based on the optimization level? It is already enabling fast-isel that way, and -O0 -regalloc=linearscan (which is the default) doesn't really make sense.

LLVMTargetMachine::addCommonCodeGenPasses does a decent job of building a codegen pipeline, except it needs help picking a register allocator. There is no good reason for that.

/jakob

Ok, but we should switch both clang and llvm-gcc at the same time.

-Chris

You mention some potential issues on ARM, should we sort those out
before we enable it in Clang? It's somewhat more convenient to have
things be consistent.

Sure, I'll be testing some ARM code today. There is also Evan's new REG_SEQUENCE instruction to verify.

I propose that at some point you just replace the old local register
allocator with the new one, and rename the old one. Then clients don't
need to change, and we can keep the old one available for a little
while if we want, for testing.

I am not a big fan of renaming - it can easily cause confusion. Besides, fast-alloc goes with fast-isel.

But perhaps we could take this opportunity to let LLVMTargetMachine choose a register allocator based on the optimization level? It is already enabling fast-isel that way, and -O0 -regalloc=linearscan (which is the default) doesn't really make sense.

LLVMTargetMachine::addCommonCodeGenPasses does a decent job of building a codegen pipeline, except it needs help picking a register allocator. There is no good reason for that.

Agreed.

Evan