Greedy register allocation

Perhaps you noticed that LLVM gained a new optimizing register allocator yesterday (r130568). Linear scan is going away, and RAGreedy is the new default for optimizing builds.

Hopefully, you noticed because your binaries were suddenly 2% smaller and 10% faster*. Some noticed because LLVM started crashing or miscompiling their code. Greedy replaces a fairly big chunk of the code generator, so there will be some bugs. Please file bug reports!

Linear scan will stick around for a limited time. It can be enabled with ‘clang -mllvm -regalloc=linearscan’ and ‘llc -regalloc=linearscan’. If you think you found a bug in the new allocator, please verify that the problem goes away when switching back to linear scan.

I would also like to hear about instances where greedy produces obviously silly code, even if it is technically correct.

Share and Enjoy!

*) Individual results may vary.

Greedy Register Allocator

The new register allocator is designed to be more flexible than linear scan. It still uses the LiveInterval data structure for interference checking, but it can allocate live ranges in any order.

The most important new features are:

  • Allow code editing on the fly. Currently, spill code and copies are inserted immediately, but other code changes are also possible.
  • Flexible order of allocation and eviction. The ordering can be used to fine tune the register allocator’s behavior. Register assignments can be undone without backtracking.
  • Live range splitting. Global and local live ranges are split by two different algorithms, both guided by interference from already allocated registers.
  • Prefer cheap registers for busy live ranges. The x86-64 and ARM thumb2 targets have registers that are more expensive to encode in instructions. Those registers are used for the less busy live ranges which improves code size.

Code Size

The new allocator can improve code size because live range splitting causes less spill code to be inserted, and on x86-64 and thumb2, expensive registers are used less. It can increase code size by inserting more register copies in order to avoid a spill.

On the nightly test suite, we see these changes in total code size:

i386: -1.2%

x86-64: -1.6%

armv7: -2.3%

Using clang to build itself on x86-64, we get these __text segments for Release+Debug/bin/clang:

LinScan: 15750170

Greedy: 15486090

Difference: -1.7%

More complete code size data below, there is a lot of variation.

Performance

Live range splitting improves the performance of compiled code by eliminating spills, or by moving spills out of loops. Live ranges that cross a function call and are also used in a hot loop get split so there are no spills in the loop. Long, complicated basic blocks benefit from local live range splitting.

The register-starved i386 target benefits the most. This is the change in execution time for the SPEC benchmarks that change by more than 3% (minus means faster, plus slower):

Targeting i386:

-19.3% 164.gzip

-12.5% 433.milc

-8.8% 473.astar

-7.4% 401.bzip2

-6.4% 183.equake

-4.9% 456.hmmer

-4.6% 186.crafty

-4.6% 188.ammp

-4.1% 403.gcc

-4.0% 256.bzip2

-3.2% 197.parser

-3.1% 175.vpr

-3.0% 464.h264ref

+6.7% 177.mesa

With more registers and out-of-order execution hiding the cost of spilling, x86-64 is more mixed. I suspect this architecture is more sensitive to code layout issues than to register allocation:

Targeting x86-64:

-6.4% 464.h264ref

-6.1% 256.bzip2

-5.2% 183.equake

-4.8% 447.dealII

-3.9% 400.perlbench

-3.5% 401.bzip2

-3.3% 255.vortex

+3.8% 186.crafty

+5.0% 462.libquantum

+8.0% 471.omnetpp

Finally, armv7/thumb2 running on a Cortex-A9 CPU does quite well:

Targeting armv7:

-6.2% 447.dealII

-4.4% 183.equake

-4.1% 462.libquantum

-3.5% 401.bzip2

Clang builds llvm+clang about 0.5% faster when it was built with the greedy register allocator.

More data below.

Compile Time

Linear scan spends about 50% of its compile time in VirtRegRewriter, unfolding stack accesses that it folded when spilling. The new allocator uses live range splitting instead, so it can use a much faster trivial rewriter that never unfolds memory accesses.

This means that the new allocator is faster than linear scan when compiling small functions (such as Objective-C code), but global live range splitting becomes expensive when compiling large functions. When compiling 403.gcc, the new allocator uses 15-40% more time than linear scan, depending on the target. These times are for a PIC build of 403.gcc, timing just the register allocator pass:

LinScan Greedy LLC Total

armv7 1.77s 2.47s +3.2%

x86-64 1.91s 2.38s +1.9%

i386 2.42s 2.82s +0.9%

These numbers translate to a 1-3% increase in llc total code generation time. Linear scan is faster on ARM because that target doesn’t fold memory operands. That means there is nothing to unfold for the expensive VirtRegRewriter. ARM’s use of base registers in functions with stack frames also means more work for global live range splitting.

A more realistic example is timing clang building llvm+clang for x86-64:

LinScan: 1991.20s

Greedy: 1991.93s

Difference: none

The greedy register allocator speeds up clang a bit on x86-64, so the new allocator is actually faster than linear scan when self-hosting:

Greedy hosting greedy: 1981.56s

Difference: -0.5% (greedy faster)

Timing llc on Sketch.bc created from all the Objective-C source files in the Xcode Sketch sample project, we get:

LinScan: 0.733s

Greedy: 0.724s

Difference: -1.23% (greedy faster)

Code Size Data

This is all the nightly test suite tests that produce a __text segment larger than 8K and a code size difference larger than 1%. Negative numbers means smaller code, positive larger.

Targeting i386 PIC -O2:

-7.3% MultiSource/Benchmarks/MiBench/security-rijndael/Output/security-rijndael
-7.2% MultiSource/Benchmarks/Prolangs-C/simulator/Output/simulator
-5.3% MultiSource/Applications/hexxagon/Output/hexxagon
-5.2% MultiSource/Applications/SIBsim4/Output/SIBsim4
-4.8% MultiSource/Benchmarks/MallocBench/cfrac/Output/cfrac
-4.1% MultiSource/Applications/d/Output/make_dparser
-4.0% MultiSource/Applications/JM/ldecod/Output/ldecod
-3.9% External/Nurbs/Output/nurbs
-3.7% MultiSource/Benchmarks/Ptrdist/bc/Output/bc
-3.4% MultiSource/Applications/lemon/Output/lemon
-3.4% SingleSource/Benchmarks/Misc-C++/Output/bigfib
-3.4% External/SPEC/CFP2000/183.equake/Output/183.equake
-3.2% MultiSource/Applications/hbd/Output/hbd
-3.1% MultiSource/Applications/minisat/Output/minisat
-3.0% MultiSource/Benchmarks/tramp3d-v4/Output/tramp3d-v4
-2.9% External/skidmarks10/Output/skidmarks
-2.9% MultiSource/Benchmarks/ASCI_Purple/SMG2000/Output/smg2000
-2.8% External/SPEC/CINT95/132.ijpeg/Output/132.ijpeg
-2.8% MultiSource/Benchmarks/MiBench/telecomm-gsm/Output/telecomm-gsm
-2.8% MultiSource/Benchmarks/mediabench/gsm/toast/Output/toast
-2.7% External/SPEC/CINT95/134.perl/Output/134.perl
-2.6% MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/Output/cjpeg
-2.6% MultiSource/Applications/oggenc/Output/oggenc
-2.5% MultiSource/Benchmarks/VersaBench/dbms/Output/dbms
-2.5% External/SPEC/CINT2006/473.astar/Output/473.astar
-2.5% SingleSource/Benchmarks/Adobe-C++/Output/simple_types_loop_invariant
-2.5% MultiSource/Benchmarks/Prolangs-C/loader/Output/loader
-2.4% MultiSource/Benchmarks/sim/Output/sim
-2.4% External/SPEC/CFP2006/433.milc/Output/433.milc
-2.4% External/SPEC/CINT95/124.m88ksim/Output/124.m88ksim
-2.4% MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/Output/timberwolfmc
-2.4% MultiSource/Applications/treecc/Output/treecc
-2.3% MultiSource/Benchmarks/MiBench/consumer-jpeg/Output/consumer-jpeg
-2.2% External/SPEC/CFP2006/450.soplex/Output/450.soplex
-2.2% External/SPEC/CINT2000/186.crafty/Output/186.crafty
-2.2% SingleSource/Benchmarks/Misc-C++/Output/stepanov_container
-2.1% External/SPEC/CINT2006/456.hmmer/Output/456.hmmer
-2.1% External/SPEC/CINT2000/254.gap/Output/254.gap
-2.0% MultiSource/Applications/JM/lencod/Output/lencod
-2.0% MultiSource/Benchmarks/FreeBench/pifft/Output/pifft
-1.9% External/SPEC/CFP2006/447.dealII/Output/447.dealII
-1.9% External/SPEC/CINT2006/401.bzip2/Output/401.bzip2
-1.9% External/SPEC/CINT2006/464.h264ref/Output/464.h264ref
-1.9% MultiSource/Applications/SPASS/Output/SPASS
-1.8% MultiSource/Benchmarks/McCat/18-imp/Output/imp
-1.8% External/SPEC/CFP2006/444.namd/Output/444.namd
-1.8% MultiSource/Benchmarks/Prolangs-C/unix-smail/Output/unix-smail
-1.8% MultiSource/Benchmarks/MallocBench/espresso/Output/espresso
-1.7% MultiSource/Benchmarks/MallocBench/gs/Output/gs
-1.7% MultiSource/Applications/sqlite3/Output/sqlite3
-1.7% External/SPEC/CINT2000/255.vortex/Output/255.vortex
-1.7% External/SPEC/CINT95/147.vortex/Output/147.vortex
-1.6% External/SPEC/CINT2006/483.xalancbmk/Output/483.xalancbmk
-1.5% External/SPEC/CINT2000/256.bzip2/Output/256.bzip2
-1.5% External/SPEC/CINT2000/252.eon/Output/252.eon
-1.5% MultiSource/Benchmarks/FreeBench/fourinarow/Output/fourinarow
-1.5% MultiSource/Benchmarks/mafft/Output/pairlocalalign
-1.5% MultiSource/Applications/lua/Output/lua
-1.4% MultiSource/Applications/siod/Output/siod
-1.4% External/SPEC/CFP2000/177.mesa/Output/177.mesa
-1.3% MultiSource/Benchmarks/Bullet/Output/bullet
-1.3% MultiSource/Applications/spiff/Output/spiff
-1.3% External/SPEC/CINT2000/175.vpr/Output/175.vpr
-1.2% Total
-1.1% External/SPEC/CINT95/130.li/Output/130.li
-1.1% External/SPEC/CINT2006/403.gcc/Output/403.gcc
-1.1% External/SPEC/CINT2006/429.mcf/Output/429.mcf
-1.1% MultiSource/Applications/lambda-0.1.3/Output/lambda
-1.0% MultiSource/Benchmarks/Trimaran/enc-3des/Output/enc-3des
-1.0% MultiSource/Benchmarks/Ptrdist/yacr2/Output/yacr2
+1.1% MultiSource/Benchmarks/MiBench/office-ispell/Output/office-ispell
+1.2% SingleSource/Benchmarks/Misc-C+±EH/Output/spirit
+1.4% SingleSource/Benchmarks/Misc/Output/oourafft
+1.4% MultiSource/Benchmarks/Prolangs-C/assembler/Output/assembler
+2.1% MultiSource/Benchmarks/Prolangs-C/unix-tbl/Output/unix-tbl
+2.3% MultiSource/Benchmarks/Prolangs-C/gnugo/Output/gnugo
+2.4% External/SPEC/CFP2000/179.art/Output/179.art
+2.5% SingleSource/Benchmarks/Adobe-C++/Output/functionobjects
+3.3% SingleSource/Benchmarks/Adobe-C++/Output/simple_types_constant_folding
+3.3% External/SPEC/CINT2006/471.omnetpp/Output/471.omnetpp
+7.1% MultiSource/Benchmarks/Prolangs-C/cdecl/Output/cdecl

Targeting x86-64 PIC -O2:

-5.3% MultiSource/Benchmarks/ASCI_Purple/SMG2000/Output/smg2000

-5.1% MultiSource/Applications/SIBsim4/Output/SIBsim4

-4.7% External/SPEC/CINT2006/401.bzip2/Output/401.bzip2

-4.4% External/skidmarks10/Output/skidmarks

-3.9% MultiSource/Applications/JM/ldecod/Output/ldecod

-3.9% MultiSource/Benchmarks/MallocBench/cfrac/Output/cfrac

-3.6% MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/Output/timberwolfmc

-3.2% MultiSource/Benchmarks/MiBench/security-rijndael/Output/security-rijndael

-3.0% MultiSource/Benchmarks/Prolangs-C/simulator/Output/simulator

-2.9% External/SPEC/CINT2000/252.eon/Output/252.eon

-2.9% MultiSource/Benchmarks/Prolangs-C/cdecl/Output/cdecl

-2.8% MultiSource/Benchmarks/MiBench/telecomm-gsm/Output/telecomm-gsm

-2.8% MultiSource/Benchmarks/mediabench/gsm/toast/Output/toast

-2.8% MultiSource/Benchmarks/Ptrdist/bc/Output/bc

-2.7% SingleSource/Benchmarks/Adobe-C++/Output/functionobjects

-2.6% MultiSource/Applications/JM/lencod/Output/lencod

-2.6% MultiSource/Applications/Burg/Output/burg

-2.6% MultiSource/Benchmarks/tramp3d-v4/Output/tramp3d-v4

-2.5% MultiSource/Applications/d/Output/make_dparser

-2.5% External/SPEC/CINT95/124.m88ksim/Output/124.m88ksim

-2.5% External/SPEC/CINT2006/458.sjeng/Output/458.sjeng

-2.4% MultiSource/Benchmarks/Olden/bh/Output/bh

-2.4% MultiSource/Benchmarks/MallocBench/espresso/Output/espresso

-2.4% External/SPEC/CINT2000/300.twolf/Output/300.twolf

-2.4% MultiSource/Benchmarks/Prolangs-C/unix-smail/Output/unix-smail

-2.3% MultiSource/Applications/kimwitu++/Output/kc

-2.3% MultiSource/Applications/hbd/Output/hbd

-2.3% MultiSource/Applications/spiff/Output/spiff

-2.3% MultiSource/Applications/hexxagon/Output/hexxagon

-2.3% External/SPEC/CINT95/134.perl/Output/134.perl

-2.3% External/SPEC/CINT2000/256.bzip2/Output/256.bzip2

-2.2% MultiSource/Benchmarks/Prolangs-C/loader/Output/loader

-2.2% External/SPEC/CFP2000/183.equake/Output/183.equake

-2.2% External/SPEC/CINT2006/464.h264ref/Output/464.h264ref

-2.2% MultiSource/Applications/ClamAV/Output/clamscan

-2.1% External/SPEC/CFP2000/179.art/Output/179.art

-2.1% External/SPEC/CINT2006/471.omnetpp/Output/471.omnetpp

-2.1% MultiSource/Benchmarks/PAQ8p/Output/paq8p

-2.1% MultiSource/Benchmarks/Ptrdist/yacr2/Output/yacr2

-2.1% External/Povray/Output/povray

-2.0% External/SPEC/CFP2006/470.lbm/Output/470.lbm

-2.0% External/SPEC/CINT2006/429.mcf/Output/429.mcf

-2.0% SingleSource/Benchmarks/Misc-C++/Output/stepanov_container

-2.0% External/SPEC/CINT2006/456.hmmer/Output/456.hmmer

-2.0% External/SPEC/CINT2000/164.gzip/Output/164.gzip

-2.0% MultiSource/Benchmarks/mafft/Output/pairlocalalign

-2.0% MultiSource/Benchmarks/Prolangs-C/assembler/Output/assembler

-2.0% MultiSource/Benchmarks/MiBench/office-ispell/Output/office-ispell

-2.0% External/SPEC/CFP2006/433.milc/Output/433.milc

-1.9% MultiSource/Benchmarks/MiBench/consumer-typeset/Output/consumer-typeset

-1.9% External/SPEC/CFP2000/177.mesa/Output/177.mesa

-1.8% External/SPEC/CFP2006/447.dealII/Output/447.dealII

-1.8% External/SPEC/CFP2000/188.ammp/Output/188.ammp

-1.8% External/SPEC/CINT2000/186.crafty/Output/186.crafty

-1.7% MultiSource/Benchmarks/MiBench/consumer-lame/Output/consumer-lame

-1.7% External/SPEC/CINT95/099.go/Output/099.go

-1.7% MultiSource/Applications/SPASS/Output/SPASS

-1.7% SingleSource/Benchmarks/Adobe-C++/Output/simple_types_constant_folding

-1.7% MultiSource/Benchmarks/FreeBench/fourinarow/Output/fourinarow

-1.6% MultiSource/Applications/oggenc/Output/oggenc

-1.6% Total

-1.6% External/SPEC/CINT2006/403.gcc/Output/403.gcc

-1.6% External/SPEC/CINT95/130.li/Output/130.li

-1.6% MultiSource/Benchmarks/VersaBench/dbms/Output/dbms

-1.6% MultiSource/Applications/minisat/Output/minisat

-1.5% MultiSource/Benchmarks/Prolangs-C/agrep/Output/agrep

-1.5% MultiSource/Benchmarks/Prolangs-C/bison/Output/mybison

-1.4% MultiSource/Benchmarks/Trimaran/enc-3des/Output/enc-3des

-1.4% MultiSource/Applications/sqlite3/Output/sqlite3

-1.3% External/SPEC/CFP2006/450.soplex/Output/450.soplex

-1.3% External/SPEC/CINT2006/400.perlbench/Output/400.perlbench

-1.2% MultiSource/Benchmarks/MiBench/consumer-jpeg/Output/consumer-jpeg

-1.2% MultiSource/Benchmarks/MallocBench/gs/Output/gs

-1.2% MultiSource/Benchmarks/MiBench/security-blowfish/Output/security-blowfish

-1.2% MultiSource/Benchmarks/Prolangs-C/football/Output/football

-1.2% External/SPEC/CINT2000/197.parser/Output/197.parser

-1.2% MultiSource/Applications/lemon/Output/lemon

-1.2% MultiSource/Benchmarks/MiBench/automotive-susan/Output/automotive-susan

-1.2% External/SPEC/CINT2000/254.gap/Output/254.gap

-1.1% External/SPEC/CINT2000/253.perlbmk/Output/253.perlbmk

-1.0% External/SPEC/CINT95/132.ijpeg/Output/132.ijpeg

-1.0% MultiSource/Benchmarks/Bullet/Output/bullet

+1.2% MultiSource/Benchmarks/FreeBench/pifft/Output/pifft

+1.6% MultiSource/Benchmarks/McCat/18-imp/Output/imp

+1.7% SingleSource/Benchmarks/Adobe-C++/Output/loop_unroll

+3.2% SingleSource/Benchmarks/Misc/Output/oourafft

Targeting thumbv7 PIC -O2:

-6.8% MultiSource/Benchmarks/Ptrdist/yacr2/Output/yacr2

-5.8% SingleSource/Benchmarks/Adobe-C++/Output/simple_types_constant_folding

-5.8% MultiSource/Benchmarks/Ptrdist/bc/Output/bc

-5.6% External/SPEC/CINT2000/256.bzip2/Output/256.bzip2

-5.6% MultiSource/Applications/Burg/Output/burg

-5.5% MultiSource/Benchmarks/Prolangs-C/assembler/Output/assembler

-5.4% External/SPEC/CINT2006/401.bzip2/Output/401.bzip2

-5.3% External/SPEC/CINT2000/186.crafty/Output/186.crafty

-5.3% MultiSource/Benchmarks/FreeBench/fourinarow/Output/fourinarow

-5.2% MultiSource/Benchmarks/PAQ8p/Output/paq8p

-5.2% MultiSource/Benchmarks/MiBench/automotive-susan/Output/automotive-susan

-5.2% MultiSource/Benchmarks/ASCI_Purple/SMG2000/Output/smg2000

-5.1% External/SPEC/CINT2006/464.h264ref/Output/464.h264ref

-5.1% MultiSource/Benchmarks/MiBench/consumer-typeset/Output/consumer-typeset

-5.0% External/SPEC/CINT95/099.go/Output/099.go

-4.5% External/SPEC/CINT2000/300.twolf/Output/300.twolf

-4.1% MultiSource/Benchmarks/MiBench/security-rijndael/Output/security-rijndael

-4.1% MultiSource/Benchmarks/Prolangs-C/gnugo/Output/gnugo

-4.0% MultiSource/Benchmarks/sim/Output/sim

-3.9% MultiSource/Benchmarks/MiBench/consumer-lame/Output/consumer-lame

-3.6% MultiSource/Applications/hexxagon/Output/hexxagon

-3.6% MultiSource/Benchmarks/MiBench/telecomm-gsm/Output/telecomm-gsm

-3.6% MultiSource/Benchmarks/mediabench/gsm/toast/Output/toast

-3.5% External/SPEC/CFP2000/188.ammp/Output/188.ammp

-3.5% MultiSource/Applications/spiff/Output/spiff

-3.4% MultiSource/Applications/JM/ldecod/Output/ldecod

-3.3% External/SPEC/CINT2000/164.gzip/Output/164.gzip

-3.2% MultiSource/Benchmarks/Prolangs-C/agrep/Output/agrep

-3.2% MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/Output/timberwolfmc

-3.1% MultiSource/Applications/d/Output/make_dparser

-3.1% MultiSource/Applications/lemon/Output/lemon

-2.8% External/SPEC/CINT95/124.m88ksim/Output/124.m88ksim

-2.8% MultiSource/Applications/JM/lencod/Output/lencod

-2.7% MultiSource/Applications/SIBsim4/Output/SIBsim4

-2.7% External/SPEC/CINT2006/400.perlbench/Output/400.perlbench

-2.6% MultiSource/Applications/kimwitu++/Output/kc

-2.6% MultiSource/Applications/siod/Output/siod

-2.6% External/SPEC/CINT95/130.li/Output/130.li

-2.6% MultiSource/Benchmarks/tramp3d-v4/Output/tramp3d-v4

-2.6% External/SPEC/CINT2000/253.perlbmk/Output/253.perlbmk

-2.4% MultiSource/Applications/SPASS/Output/SPASS

-2.4% External/SPEC/CINT95/134.perl/Output/134.perl

-2.4% MultiSource/Benchmarks/MallocBench/espresso/Output/espresso

-2.3% External/SPEC/CINT2006/471.omnetpp/Output/471.omnetpp

-2.3% External/SPEC/CINT2006/445.gobmk/Output/445.gobmk

-2.3% Total

-2.2% MultiSource/Benchmarks/Prolangs-C/unix-tbl/Output/unix-tbl

-2.2% External/SPEC/CFP2000/179.art/Output/179.art

-2.1% MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/Output/mpeg2decode

-2.1% External/SPEC/CINT2000/175.vpr/Output/175.vpr

-2.1% MultiSource/Benchmarks/Prolangs-C/football/Output/football

-2.0% External/SPEC/CFP2000/177.mesa/Output/177.mesa

-2.0% MultiSource/Benchmarks/VersaBench/dbms/Output/dbms

-1.9% External/SPEC/CINT95/147.vortex/Output/147.vortex

-1.9% External/SPEC/CINT2000/197.parser/Output/197.parser

-1.8% MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/Output/cjpeg

-1.8% External/SPEC/CINT2000/255.vortex/Output/255.vortex

-1.8% External/SPEC/CFP2006/450.soplex/Output/450.soplex

-1.8% MultiSource/Benchmarks/Prolangs-C/unix-smail/Output/unix-smail

-1.7% MultiSource/Applications/hbd/Output/hbd

-1.7% External/SPEC/CINT2000/254.gap/Output/254.gap

-1.6% MultiSource/Benchmarks/Prolangs-C/bison/Output/mybison

-1.5% External/skidmarks10/Output/skidmarks

-1.4% MultiSource/Benchmarks/MallocBench/gs/Output/gs

-1.4% External/SPEC/CINT2006/483.xalancbmk/Output/483.xalancbmk

-1.4% External/SPEC/CFP2006/444.namd/Output/444.namd

-1.4% MultiSource/Applications/ClamAV/Output/clamscan

-1.4% External/SPEC/CINT95/132.ijpeg/Output/132.ijpeg

-1.2% External/SPEC/CFP2006/447.dealII/Output/447.dealII

-1.2% SingleSource/Benchmarks/Adobe-C++/Output/simple_types_loop_invariant

-1.2% MultiSource/Applications/oggenc/Output/oggenc

-1.1% External/SPEC/CINT2006/462.libquantum/Output/462.libquantum

+1.1% MultiSource/Benchmarks/Bullet/Output/bullet

+1.8% MultiSource/Benchmarks/FreeBench/pifft/Output/pifft

+2.2% External/SPEC/CINT2000/252.eon/Output/252.eon

+9.7% MultiSource/Benchmarks/Trimaran/enc-3des/Output/enc-3des

Performance Data

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

    +10.0% SingleSource/Benchmarks/CoyoteBench/huffbench
    +12.0% SingleSource/Benchmarks/McGill/chomp
    +18.0% SingleSource/Benchmarks/BenchmarkGame/n-body
    +45.5% SingleSource/Benchmarks/BenchmarkGame/puzzle
    +10.0% SingleSource/Benchmarks/Shootout/heapsort
    +10.5% MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des
    +10.9% SingleSource/Benchmarks/Shootout-C++/heapsort
    +11.7% MultiSource/Benchmarks/Ptrdist/bc/bc
    +12.0% MultiSource/Benchmarks/McCat/17-bintr/bintr
    +55.2% SingleSource/Benchmarks/Shootout/methcall

Yikes! Do we know why these codes got so much worse? Even 5% is a big
deal on x86.

                                 -Dave

On x86-64, n-body and puzzle have the exact same instructions as with linear scan. The only difference is the choice of registers. This causes some loops to be a few bytes longer or shorter which can easily change performance by that much if that small loop is all the benchmark does.

The greedy allocator is trying to pick registers so inner loops are as small as possible, but that is not always the right thing to do.

Unfortunately, we don't model the effects of code alignment, so there is a lot of luck involved.

I am working my way through the regressions, looking for things the allocator did wrong. Any help is appreciated, please file bugs if you find examples of stupid register allocation.

/jakob

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

Yikes! Do we know why these codes got so much worse? Even 5% is a big
deal on x86.

On x86-64, n-body and puzzle have the exact same instructions as with
linear scan. The only difference is the choice of registers. This
causes some loops to be a few bytes longer or shorter which can easily
change performance by that much if that small loop is all the
benchmark does.

Ok, I can believe that.

The greedy allocator is trying to pick registers so inner loops are as
small as possible, but that is not always the right thing to do.

How does it balance that against spill cost?

Unfortunately, we don't model the effects of code alignment, so there
is a lot of luck involved.

As with any allocator. :slight_smile:

I am working my way through the regressions, looking for things the
allocator did wrong. Any help is appreciated, please file bugs if you
find examples of stupid register allocation.

Certainly. I would ask that we keep linearscan around, if possible, as
long as there are significant regressions like this. Our customers tend
to really, really care about performance.

                             -Dave

I added the CostPerUse field to the register descriptions. The allocator will try to minimize the spill weight assigned to registers with a CostPerUse. It does it by swapping physical register assignments, it won't do it if it requires extra spilling.

This is actually the cause of the n-body regression. The benchmark has nested loops:

  %vreg1 = const pool load
header1:
  ; large blocks with lots of floating point ops
header2:
  ; small loop using %vreg1
  jnz header2
...
  jnz header1

The def of %vreg1 has been hoisted by LICM so it is live across a block with lots of floating point code. The allocator uses the low xmm registers for the large block, and %xmm8 is left for %vreg1 which has a low spill weight. This significantly improves code size, but the small loop suffers.

A low xmm register could be used for %vreg1, but would need to be rematerialized. The allocator won't go that far just to use cheaper registers.

In this case it might have helped to split the live range and rematerialize, but usually that won't be the case.

/jakob

That's reasonable, and it is also useful to keep it around as a reference when greedy breaks.

On the other hand, I really want to clean up the code surrounding register allocation, and that is much easier to do after linear scan is gone. There is a good chance it won't make it to the 3.0 release.

/jakob

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

The greedy allocator is trying to pick registers so inner loops are as
small as possible, but that is not always the right thing to do.

How does it balance that against spill cost?

I added the CostPerUse field to the register descriptions. The
allocator will try to minimize the spill weight assigned to registers
with a CostPerUse. It does it by swapping physical register
assignments, it won't do it if it requires extra spilling.

CostPerUse models the encoding size of the register?

This is actually the cause of the n-body regression. The benchmark has nested loops:

  %vreg1 = const pool load
header1:
  ; large blocks with lots of floating point ops
header2:
  ; small loop using %vreg1
  jnz header2
...
  jnz header1

The def of %vreg1 has been hoisted by LICM so it is live across a
block with lots of floating point code. The allocator uses the low xmm
registers for the large block, and %xmm8 is left for %vreg1 which has
a low spill weight. This significantly improves code size, but the
small loop suffers.

Why does %xmm8 have a low spill weight? It's used in an inner loop.

In this case it might have helped to split the live range and
rematerialize, but usually that won't be the case.

That was my initial reaction. Splitting should have at least
rematerialized the value just before header2. That should significantly
improve things. This is a classic motivational case for live range
splitting.

Another way to approach this is to add a register pressure heuristic to
LICM so it doesn't spill so much stuff out over such a large loop body.

                                   -Dave

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

The greedy allocator is trying to pick registers so inner loops are as
small as possible, but that is not always the right thing to do.

How does it balance that against spill cost?

I added the CostPerUse field to the register descriptions. The
allocator will try to minimize the spill weight assigned to registers
with a CostPerUse. It does it by swapping physical register
assignments, it won't do it if it requires extra spilling.

CostPerUse models the encoding size of the register?

Yes, something like that.

This is actually the cause of the n-body regression. The benchmark has nested loops:

  %vreg1 = const pool load
header1:
  ; large blocks with lots of floating point ops
header2:
  ; small loop using %vreg1
  jnz header2
...
  jnz header1

The def of %vreg1 has been hoisted by LICM so it is live across a
block with lots of floating point code. The allocator uses the low xmm
registers for the large block, and %xmm8 is left for %vreg1 which has
a low spill weight. This significantly improves code size, but the
small loop suffers.

Why does %xmm8 have a low spill weight? It's used in an inner loop.

Because it is rematerializable and live across a big block where it isn't used.

In this case it might have helped to split the live range and
rematerialize, but usually that won't be the case.

That was my initial reaction. Splitting should have at least
rematerialized the value just before header2. That should significantly
improve things. This is a classic motivational case for live range
splitting.

Well, not really. Note there there are plenty of registers available and no spilling is neccessary.

It's just that an REX prefix is required on some instructions when %xmm8 is used. Is it worth it to undo LICM just for that? In this case, probably. In general, no.

/jakob

The new allocator doesn't use VirtRegRewriter. Deleting that alone will improve the average code quality significantly. The worst parts of LiveIntervalAnalysis.cpp can also be deleted.

Once all that code is gone, it is much easier to refactor the LiveIntervals interface.

/jakob

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

That was my initial reaction. Splitting should have at least
rematerialized the value just before header2. That should significantly
improve things. This is a classic motivational case for live range
splitting.

Well, not really. Note there there are plenty of registers available
and no spilling is neccessary.

Oh, I misunderstood then. Thanks for clarifying.

It's just that an REX prefix is required on some instructions when
%xmm8 is used. Is it worth it to undo LICM just for that? In this
case, probably. In general, no.

Ah, so you're saying the regression is due to the inner loop icache
footprint increasing. Ok, that makes total sense to me. I agree this
is a difficult thing to get right in a general sort of way. Perhaps the
CostPerUse (or whatwever heuristics use it) can factor in the loop body
size so that tight loops are favored for smaller encodings.

                               -Dave

Jakob Stoklund Olesen <stoklund@2pi.dk> writes:

Believe me, I understand the desire to clean this code up. What's
particular about linearscan in this regard? What parts are hard to
clean up due to linearscan?

The new allocator doesn't use VirtRegRewriter. Deleting that alone
will improve the average code quality significantly. The worst parts
of LiveIntervalAnalysis.cpp can also be deleted.

Aha. Yes, those are particularly nasty bits. I totally get the desire
to zap 'em.

Once all that code is gone, it is much easier to refactor the
LiveIntervals interface.

I grok that too.

I have no problem in general with getting rid of linearscan. I just
want to make sure we don't have really terrible regressions before doing
so. A 50% performance hit is a pretty terrible regression. :slight_smile:

It's easier for me to time an upgrade on our side if I'm reasonably
certain it's not going to cause a bunch of work fixing such regressions.

Several of us are working really hard on our end to catch up to the
latest LLVM release so we can stay more current than we have been. I
want to make sure we don't fall behind again. That benefits everyone
because once we get reasonably close to trunk I can flow patches
upstream much more quickly than I've been able to in the past.

It's easier to stay current when we know we won't get in trouble for
doing so. :slight_smile:

                              -Dave

It is almost certainly that the inner loop doesn't fit in the processors predecode loop buffer. Modern intel X86 chips have a buffer that can hold a very small number of instructions and is bound by instruction count, code size, and sometimes # cache lines. If a loop fits in this it allows the processor to turn off the decoder completely for the loop, a significant power and performance win.

I don't know how realistic it is to model the loop buffer in the register allocator, but this would a very interesting thing to try to optimize for in a later pass. If an inner loop "almost" fits, then it would probably be worth heroic effort to try to reduce the size of it to shave off a few bytes.

-Chris

It's just that an REX prefix is required on some instructions when
%xmm8 is used. Is it worth it to undo LICM just for that? In this
case, probably. In general, no.

Ah, so you're saying the regression is due to the inner loop icache
footprint increasing. Ok, that makes total sense to me. I agree this
is a difficult thing to get right in a general sort of way. Perhaps the
CostPerUse (or whatwever heuristics use it) can factor in the loop body
size so that tight loops are favored for smaller encodings.

It is almost certainly that the inner loop doesn't fit in the processors predecode loop buffer. Modern intel X86 chips have a buffer that can hold a very small number of instructions and is bound by instruction count, code size, and sometimes # cache lines. If a loop fits in this it allows the processor to turn off the decoder completely for the loop, a significant power and performance win.

I don't know how realistic it is to model the loop buffer in the register allocator, but this would a very interesting thing to try to optimize for in a later pass. If an inner loop "almost" fits, then it would probably be worth heroic effort to try to reduce the size of it to shave off a few bytes.

Jakob and I have talked about this briefly. The idea is to insert a copy from the larger register to the smaller register in the loop preheader and change the references inside the loop. However, the reality is this is a lot harder than it sounds.

1. When is this profitable? We can model size of loop buffer. But this is also dependent on loop alignment. We may have to align small loops on 32 byte boundary to get this right.

2. When should we do this? I was thinking of a late pass. But it's harder to reason about dependencies when register allocation is done. Perhaps it should be done before the virtual registers are rewritten?

3. How restrictive is the optimization? What if the register is updated inside the loop? We have to be careful about instructions requiring specific register inputs. What if the register is updated and live out of the loop? We need to copy it back at loop exits?

We should start tackling the the restrictive forms of the problem. That should fix a number of cases where linear scan got lucky. I am sure there are a lot more interesting cases though.

Evan

Yep, doing this as a late pass, and having it only handle the "easy and obvious" cases would make a lot of sense. I bet that some non-trivial number of the major speedups that Jakob is seeing are due to "greedy getting lucky". The bad thing about this is that relying on luck makes performance analysis extremely difficult. It is much better to actively control these things where possible so that we both get good performance all the time, and have more useful and comparable perf analysis. I know you already know this :slight_smile:

-Chris

Chris Lattner <clattner@apple.com> writes:

I don't know how realistic it is to model the loop buffer in the
register allocator, but this would a very interesting thing to try to
optimize for in a later pass. If an inner loop "almost" fits, then it
would probably be worth heroic effort to try to reduce the size of it
to shave off a few bytes.

I like this approach as well as it could cover a number of peephole-type
optimizations.

                              -Dave

Evan Cheng <evan.cheng@apple.com> writes:

1. When is this profitable? We can model size of loop buffer. But this
is also dependent on loop alignment. We may have to align small loops
on 32 byte boundary to get this right.

We almost always want to do this anyway on the newest processors.

                          -Dave

I will dare a comment on this topic well over my head, so my answer will probably only reflect my ignorance on the deepness of the subject, however, in any case i hope to get some clarification of my concepts

let me say more before i go into the idea on this post: I've used coroutines a lot in my c++ programming, and certainly when you work a lot of time with a hammer, everything starts looking like a nail.

However, it has always seemed to me that exception handling (at least on c++) is just a particular syntax of a subset of coroutine semantics:

void f() { throw 1; }
void h();

void g() {
try {
   f();
} catch (...) {
   h();
}
}

assumming a hypothetical Coroutine object that supports manual cleanup of the stack allocated objects, the code above is semantically equivalent to:

struct CoroutineStack {
Coroutine* context;
CoroutineStack* next;
}
Coroutine* currentContext;
CoroutineStack * nextCatchContextLink;
void f () { currentContext->yieldTo( nextCatchContextLink->context ); }
void h();

void g() {
  Coroutine tryContext( [&] => { f(); } );
  Coroutine catchContext( [&] => { h(); } );
  //define a new link to store current catch context
  // which points to outer stack contexts in this same stack
CoroutineStack catchContextLink;
  catchContextLink.context = &catchContext;
  catchContextLink.next = nextCatchContextLink;
  nextCatchContextLink = & catchContextLink;
  tryContext();
}

in the case a rethrow happens in the catch context, the code in the catch needs to change slightly;

Coroutine catchContext( [&] => { h(); nextCatchContextLink = nextCatchContextLink ->next; } );

I don't know much about DWARF and other EH schemes, and i really can't say what can do a personality or a landing pad that you can't express with coroutine semantics, so forgive me if the above makes large oversights orits just Plain Silly.

does this make sense?

We designed the basic allocator as a baseline and intend to maintain it as a reference allocator. It factors the complexity out of the greedy allocator but otherwise provides the same interface to surrounding code. I don't think it will be long before LinearScan is more broken than GreedyRA, and BasicRA is the safer reference. Maybe Jakob didn't mention this because BasicRA is *temporarily* broken, but in a trivial way.

The regressions occurred in highly unstable benchmarks. It's not as if we were monotonically improving these scores, and GreedyRA suddenly caused a regression. The point of looking into the regressions is to finally take control of this noise now that we have a better framework for doing so.

-Andy

This is certainly one way of implementing exceptions. Unfortunately, it would be prohibitively expensive to create a new coroutine every time a new EH scope is entered — much worse than other explicit scope-tracking implementations like SEH or builtin_sj/lj.

That said, lowering IR from a "conventional" style — where cleanups appear as normal code and control dependencies are expressed directly with normal-ish control flow — to the form required by an SEH-like implementation would indeed require operations not fundamentally dissimilar from coroutine extraction.

John.