Greedy Register Allocation in LLVM 3.0

I just uploaded a blog post outlining the new register allocation algorithm in LLVM 3.0.

  http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html

Please direct comments here.

/jakob

Just a quick question: is greedy still a local allocator (i.e. only takes into consideration the current bb) or a global one (takes into consideration the whole function)?

The greedy allocator is global, but so was the old linear scan allocator.

Cameron

The greedy allocator is global, but so was the old linear scan allocator.

  In http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html
, it says "The algorithm is local, and it cannot clean up messes that
extend beyond a single basic block". Does it mean the rewriter algorithm
not the linear scan?

Regards,
chenwj

Hi Jakob,

Thanks for a very interesting description of the new register allocation algorithm in LLVM!!!

Could you elaborate a bit on the following topics:

1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar? It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many
published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators.

2) A comparison of the quality of generated allocations and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis?

3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information?
For example:
- How many lines of code it is and how it compares in this regard to the old linear scan?
- Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic :wink:
- You explained that it produces in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and why
- Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else?

Thanks,
Roman

----- Ursprüngliche Message -----

Yes, exactly. The rewriter is local.

/jakob

1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar?

I wasn't planning on that, but I might put something together in the long dark California winter nights.

It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many
published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators.

So much for peer review :wink:

Much of the terminology in LLVM's register allocation framework comes from the linear scan literature, as far as I know. Sometimes it doesn't make any sense. For example, our LiveInterval class represents a sorted list of intervals. Each interval is called a LiveRange. I am hoping to clean this stuff up gradually.

Most articles on register allocation start with the premise that register allocation is isomorphic to graph coloring. This is quite wrong in the real world where we must deal with overlapping register classes and aliasing registers. The terminology is often based on this false premise, and that makes me reluctant to adopt it.

Ideally, I would like to settle on a language that is as close as possible to the established conventions without being misleading.

I don't think you can find any production quality compiler using a 'pure' published register allocator. The devil is in the details, and the published algorithms are too simple.

Anyway, the 'basic' register allocator is quite similar to Chow and Hennessy's priority-based coloring. They separate the unconstrained live ranges first while basic just drops everything in the priority queue. Unconstrained live ranges only make sense if you assume graph coloring isomorphism.

You could compare the greedy allocator to George and Appel's iterated register coalescing running in reverse. Greedy coalesces live ranges aggressively, and splits them on demand.

2) A comparison of the quality of generated allocations and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis?

Unfortunately, it would be too much work to make a fair comparison. As I said, the published algorithms are not good enough for a real compiler. Each would require a lot of work before producing decent results. You would need to work around the missing support for overlapping register classes and register aliases, and you would need to handle pre-colored registers and call-clobbered registers somehow.

The greedy allocator was designed to handle these real-world problems as part of the core algorithm. It was my judgment that adapting something like George and Appel's iterated register coalescing to the real world would be just as much work as starting from scratch. It was my hope that handling the real-world problems from the start would lead to a cleaner design.

GCC's new integrated register allocator was designed by adapting Chaitin-Briggs to the real world, as far as I know. That is the most fair comparison I can think of.

3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information?
   For example:
    - How many lines of code it is and how it compares in this regard to the old linear scan?

A lot of source files are involved, and some are shared, so I don't have these numbers.

The worst linear scan code is the rewriter which is 2600 lines in VirtRegRewriter.cpp, and parts of LiveIntervalAnalysis.cpp.

The hard part of greedy is the live range splitting code in SplitKit.cpp at 1400 lines.

The rest is of comparable size, but greedy gets you global live range splitting for the same price.

    - Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic :wink:

Greedy is still using the same coalescer. It has been renamed RegisterCoalescer for honesty. :wink:

Since greedy handles pre-colored registers as part if its algorithm, the coalescer no longer needs to handle physical register coalescing. This code can be removed once linear scan is gone.

I suspect that parts of SplitKit.cpp might make people squint. As the author, I find it perfectly clear, of course.

    - You explained that it produces in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and why

The regressions I have seen have mostly been luck. That is, neither heuristic knows what it is doing, and one gets it just right by accident.

I have a vague sense that linear scan is sometimes doing better with some high register pressure coloring problems. It is hard to quantify, though.

    - Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else?

Something like N log (M), where N is the number of live ranges, and M is the number of continuous live range segments. I wouldn't be surprised if the worst-case behavior is much worse.

The complexity of global live range splitting depends on the topology of the CFG spanned by the live range, but each split is sub-linear to linear in the number of blocks spanned.

Local live range splitting has a known quadratic behavior in large basic blocks that we still need to eliminate. It rarely shows up in practice.

/jakob

So, does this mean that different BBs may expect the same spilled value to be in different stack slots?

RegAllocGreedy, RegAllocBasic, RegAllocLinearScan and RegAllocPBQP are global.

RegAllocFast is local. It is only used for -O0. In the future, it may get primitive support for the kind of global live ranges that clang -O0 can produce.

/jakob

No, the stack slot is assigned before a live range is spilled, so it uses the same stack slot everywhere. The rewriter is mostly eliminating multiple loads from the same stack slot in a basic block.

But that was the old complicated rewriter. The new rewriter used by greedy is completely trivial. It doesn’t know anything about stack slots. It has only one optimization: Eliminate identity copies.

/jakob

It may be more helpful to explain how LLVM's register allocator came
into existence before debating the high level algorithm.

When I began working on LLVM last October, Jakob was developing an
infrastructure for global live range splitting. It was just becoming
clear that LLVM's Linear Scan was inadequate for this work, and no one
I talked to knew how to fix it. Chris and Evan were very receptive to
replacing the register allocator, but the design space was wide
open. I knew from experience that we did not want to build a
traditional interference graph. Chris, Evan, and Jakob all agreed on
this point. I reasoned that an efficient implementation of
LiveInterval along with a new "LiveIntervalUnion" data structure and
mechanism for caching interference tests would be sufficient, and
offered to build a prototype to prove it.

This was an important step. My goal was never to develop a
theoretically superior register allocator algorithm in its own
right. On the contrary, Jakob and I wanted the global splitter to
drive allocation, rather than an allocation algorithm that fell back
to splitting when it failed. The important stake holders at the time
strongly agreed that the problem Jakob was solving, global live range
splitting, was the difficult problem that deserved attention, and that
the register allocation algorithm was only important in that it did
not inhibit splitting or spilling or unduly impact compile time. I
believe this goal led to a register allocator that is fundamentally
different from anything I'm aware of at the level of software design.

To summarize, Jakob and I both strongly agreed on the following design
goals, primarily for engineering reasons, not theoretical reasons:

- early aggressive coalescing

- no predetermined allocation order
  i.e. no fixed order CFG traversal,
       splitting changes allocation order

- no explicit interference graph

- on-the-fly live range splitting at the finest possible granularity

  When I say this the new allocator is fundamentally different from
  the other allocators I've seen, this is mainly what I'm referring
  to. At each decision point made by the allocator, the splitter may
  rewrite the code, modify live ranges, and incrementally update every
  bit of state in the allocator. There is no abandoning the current
  solution and no iterative application of an algorithm.

When I implemented the Basic/Greedy prototypes, I was able to replace
the code in RegAllocLinearScan.cpp with the code you can still see in
RegAllocBasic.cpp. You can see for yourself what Jakob means by the
new allocator being much simpler. And as Jakob explained, the real
benefit will be in removing VirtRegWriter.cpp which is made possible
by his spill code optimizations. Those optimizations are in turn made
practical by the new framework for incrementally updating the register
allocator's state.

The naming decision was literally a five minute brainstorm driven by
the need for unique filenames that are easy to type. "Greedy" is no
better description of our design than "Linear Scan" is of the LLVM
implementation by that name or "Graph Coloring" is of most
Chaitin-Briggs implementations. I believe the best name for the new
design would be one that conveys that it is actually a live range
splitter that happens to perform allocation. But I am not attached to
the names, and always figured the final algorithm, if successful,
would simply be known as the "LLVM Register Allocator".

It's worth mentioning that we considered calling it a "Priority"
allocator, but I objected to this because I felt it to be
misleading. Register reassignment and eviction were part of my
original prototype and I felt those features were more important than
allocation order. I used a priority queue for convenience, but
believed the choice was somewhat insignificant. In fact, the design
was intended to work reasonably well even if priorities were
inverted. I initially used spill weight as a place holder for whatever
dynamic heuristic Jakob would find works well with the splitter. One
thing Jakob discovered is that an "inverted" heuristic actually
cooperates better with his splitter. The Greedy allocator now
allocates registers to the largest live ranges first.

To be fair to Linear Scan, it is also worth noting that the framework
for live intervals was directly inspired by the LLVM implementation of
the linear scan algorithm.

In the end I believe the effectiveness of any allocator reflects the
work that goes into modeling the details of the target
architecture. This is not easy to achieve in a retargetable code
generator. Consequently, that is where much of the effort continues to
be directed in LLVM, rather than in attempting to rank algorithms in
their most abstract interpretation.

I hope this brief chronicle of the algorithm's inception dispels some of
its mystery.

-Andy

Hi Jakob, Hi Andy,

First of all, thanks a lot for very elaborative and interesting answers!

It may be more helpful to explain how LLVM's register allocator came
into existence before debating the high level algorithm.

When I began working on LLVM last October, Jakob was developing an
infrastructure for global live range splitting. It was just becoming
clear that LLVM's Linear Scan was inadequate for this work, and no one
I talked to knew how to fix it. Chris and Evan were very receptive to
replacing the register allocator, but the design space was wide
open. I knew from experience that we did not want to build a
traditional interference graph. Chris, Evan, and Jakob all agreed on
this point. I reasoned that an efficient implementation of
LiveInterval along with a new "LiveIntervalUnion" data structure and
mechanism for caching interference tests would be sufficient, and
offered to build a prototype to prove it.

This was an important step. My goal was never to develop a
theoretically superior register allocator algorithm in its own
right. On the contrary, Jakob and I wanted the global splitter to
drive allocation, rather than an allocation algorithm that fell back
to splitting when it failed. The important stake holders at the time
strongly agreed that the problem Jakob was solving, global live range
splitting, was the difficult problem that deserved attention, and that
the register allocation algorithm was only important in that it did
not inhibit splitting or spilling or unduly impact compile time. I
believe this goal led to a register allocator that is fundamentally
different from anything I'm aware of at the level of software design.

To summarize, Jakob and I both strongly agreed on the following design
goals, primarily for engineering reasons, not theoretical reasons:

I totally understand that the goal was/is to build a real, efficient register allocator, instead of producing a theoretical result, which may sound good in theory, but would not be very practical in terms of quality or compilation speed on real world use-cases.

- early aggressive coalescing

- no predetermined allocation order
i.e. no fixed order CFG traversal,
splitting changes allocation order

- no explicit interference graph

- on-the-fly live range splitting at the finest possible granularity

This very nicely summarizes main objectives of the new allocator.

When I say this the new allocator is fundamentally different from
the other allocators I've seen, this is mainly what I'm referring
to. At each decision point made by the allocator, the splitter may
rewrite the code, modify live ranges, and incrementally update every
bit of state in the allocator. There is no abandoning the current
solution and no iterative application of an algorithm.

While I agree that most "classical" allocators (e.g. graph coloring) do not do fine grained splitting and usually build an explicit interference graph, I also want to mention that many of the modern proposals published in the literature often do very agressive splitting (sometimes even before and after at each instruction, or at the beginning and end of each basic block). Many of proposals also avoid construction of interference graphs and replace it with on-demand interference computation, which is often very cheap and efficient, e.g. for SSA-progams. There are also descriptions how to compute liveness information very efficiently. Quite some of these register allocation proposals are also able to handle overlapping register classes. A few names to be mentioned in these areas are for example Hack, Bouchez, Boissinot, Pereira, Wimmer, etc

When I implemented the Basic/Greedy prototypes, I was able to replace
the code in RegAllocLinearScan.cpp with the code you can still see in
RegAllocBasic.cpp. You can see for yourself what Jakob means by the
new allocator being much simpler. And as Jakob explained, the real
benefit will be in removing VirtRegWriter.cpp which is made possibler c
by his spill code optimizations. Those optimizations are in turn made
practical by the new framework for incrementally updating the register
allocator's state.

The naming decision was literally a five minute brainstorm driven by
the need for unique filenames that are easy to type. "Greedy" is no
better description of our design than "Linear Scan" is of the LLVM
implementation by that name or "Graph Coloring" is of most
Chaitin-Briggs implementations. I believe the best name for the new
design would be one that conveys that it is actually a live range
splitter that happens to perform allocation. But I am not attached to
the names, and always figured the final algorithm, if successful,
would simply be known as the "LLVM Register Allocator".

It is true that names are not always reflecting the essense. But on the other hand, there is a lot of ongoing research on register allocation (and compilers in general) and it looks like more and more such efforts choose LLVM as a platform for experimentation. Quite some results and comparisons are published. So, it would be nice to have a somewhat concrete (but may be not ideal) name for the LLVM allocator rather than "LLVM register allocator". In a few years, there may be another, even better register alloctor for LLVM. If it will be called "LLVM register allocator" again, it would lead to a lot of misunderstandings when reading descriptions and papers. The typical question then will be "Which LLVM register allocator was meant?" :wink:

It's worth mentioning that we considered calling it a "Priority"
allocator, but I objected to this because I felt it to be
misleading. Register reassignment and eviction were part of my
original prototype and I felt those features were more important than
allocation order. I used a priority queue for convenience, but
believed the choice was somewhat insignificant. In fact, the design
was intended to work reasonably well even if priorities were
inverted. I initially used spill weight as a place holder for whatever
dynamic heuristic Jakob would find works well with the splitter. One
thing Jakob discovered is that an "inverted" heuristic actually
cooperates better with his splitter. The Greedy allocator now
allocates registers to the largest live ranges first.

To be fair to Linear Scan, it is also worth noting that the framework
for live intervals was directly inspired by the LLVM implementation of
the linear scan algorithm.

In the end I believe the effectiveness of any allocator reflects the
work that goes into modeling the details of the target
architecture. This is not easy to achieve in a retargetable code
generator. Consequently, that is where much of the effort continues to
be directed in LLVM, rather than in attempting to rank algorithms in
their most abstract interpretation.

Absolutely. And yet a detailed description, deep explanation and comparison (either theoretical or practical based on benchmarks) is always useful as it introduces more clarity and helps to avoid misunderstandings.
The aim is not to produce a paper of "academic" quality or something comparable just for the sake of it, because you want it to be published at a conference. There are enough people in academia who would do it :wink:
The reason should be that it makes it easier to further develop and improve the LLVM register allocator and may also lead to better register allocation algorithms overall. And there is also a lot to gain from developments outside of LLVM. Once register allocation people (both from academia and from industry) have understood how this algorithm works and why certain (practical) decisions were made, they may realize how to improve it even further by using and borrowing some approaches and tricks invented in scope of other frameworks and register allocation algorithms.

I hope this brief chronicle of the algorithm's inception dispels some of
its mystery.

Yes. It was very educative! Actually, it would be nice to have a page with a detailed description of the current and may be old register allocator as a part of the LLVM web-site. For starters, the blog entry from Jakob and explanations from this mail thread could be put there. Later it could be extended and improved. What do you think about this idea?

Thanks again,
Roman

I suppose you could call it the Tricky Jakob allocator. :slight_smile:

John.

That’s interesting. Do you have any references?

/jakob

Thanks for your interest. You make several good points and give proper credit to academics who are solving the problem in a more rigorous fashion.

I’ll let Jakob decide how to proceed with the documentation. As I said, the heavy lifting was in the splitter/spiller design, and that could use some explanation.

-Andy

Hi Jakob,

Yes. I have references. For example, from the top of my head I would name the following papers:

  1. A Generalized Algorithm for Graph-Coloring Register Allocation by Michael D. Smith, Norman Ramsey and Glenn Holloway
    http://www.cs.tufts.edu/~nr/pubs/gcra-abstract.html

  2. Register allocation by puzzle solving by Fernando Magno Quintão Pereira, Jens Palsberg
    http://llvm.org/pubs/2008-06-PLDI-PuzzleSolving.pdf

I think I’ve seen a few more papers on this topic, but cannot remember them at the moment. If I find more papers in my collection I’ll let you know.

/Roman

Thanks! The introduction to the first paper nicely explains the problem with the traditional algorithms.

/jakob

Hi Jakob,
Yes. I have references. For example, from the top of my head I would name
the following papers:
1) A Generalized Algorithm for Graph-Coloring Register Allocation
by Michael D. Smith, Norman Ramsey and Glenn Holloway
http://www.cs.tufts.edu/~nr/pubs/gcra-abstract.html
2) Register allocation by puzzle solving by Fernando Magno Quintão Pereira,
Jens Palsberg
http://llvm.org/pubs/2008-06-PLDI-PuzzleSolving.pdf

Fernando was an intern of mine at Google for the summer (now he's a
professor in Brazil :P), and has LLVM code for his allocator.
It was quite nice, from what I remember, and quite practical. I doubt
the code still works, but he was always interested in seeing it taken
over and used for real.
(He definitely understood the difference between "research" and "reality").

As Leo says, the "world of register allocation" has definitely caught
up to the realities of SSA and newer compilers in the past 5-6 years.

I propose "the LLVM 3.0 register allocator". We have no plans to throw away the greedy allocator in 3.1, but we'll probably make major enhancements to it, which make it incomparable to 3.0. If it does eventually get thrown away and replaced, then a version number can capture that.

-Chris

That makes a lot of sense to me. None of the algorithm details are sacred, and there is a good chance something essential will have changed by 3.1 and 3.2.

Meanwhile, I am keeping the RegAllocGreedy.cpp file name and the llc -regalloc=greedy command line option.

/jakob