Regalloc Refactoring

Chris Lattner wrote:

I'm definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Some suggestions:

@InProceedings{Budimlic02,
  AUTHOR = {Zoran Budimlic and Keith D. Cooper and Timothy J. Harvey
and Ken Kennedy and Timothy S. Oberg and Steven W. Reeves},
  YEAR = "2002",
  TITLE = "Fast copy coalescing and live-range identification",

Yep, this is the one I was thinking of. It is available online here:
http://www.cs.rice.edu/~keith/LACSI/pldi02.pdf

I was just looking at this today. One thing that strikes me about
all these papers I've read on the topic is that no one seems to
consider the interaction of coalescing with spilling. By definition
coalescing increases lifetimes and can cause more interferences.
Yet the papers all try to coalesce as many copies as possible.

On, say, the x86 machines, the cost of a spill is really not much
different from the cost of a copy so it's probably close to a wash in
the end. But there are many machines where the cost of each operation
is vastly different. Aggressive coalescing is not always good. Often
you'd rather copy than spill.

Is anyone aware of publications addressing the interplay among
coalescing, live range splitting, register allocation and spilling?

This is one of the reasons I want to separate all of these concerns.
We shouldn't force developers to always coalesce or always use some
generic measure of spill cost.

                                -Dave

Chris Lattner wrote:

As I work toward improving LLVM register allocation, I've
come across the need to do some refactoring.

Yay!

Beyond that, one of the issues is the "r2rmap" and "rep" function. As
you've noticed, the coallescer basically uses these to avoid rewriting the
code after it does coallescing. For example, if r1024 is coallesced with
r1026, it leaves all uses of both registers in the code, instead of
rewriting uses of r1026 with r1024 and deleting all memory of r1026. This
mades sense long ago, but now it is increasingly clear that this is a bad
idea. I would much rather have the coallescer be responsible for updating
the code. I would suggest doing this as a first step, it is clearly
goodness :slight_smile:

This is what I was trying to get at, but wasn't entirely clear about.
Does the loop nest after the call to joinIntervals in
LiveIntervals::runOnMachineFunction do this rewrite? Specifically:

   // perform a final pass over the instructions and compute spill
   // weights, coalesce virtual registers and remove identity moves.
   const LoopInfo &loopInfo = getAnalysis<LoopInfo>();

   for (MachineFunction::iterator mbbi = mf_->begin(), mbbe = mf_->end();
        mbbi != mbbe; ++mbbi) {
     [...]

       for (unsigned i = 0, e = mii->getNumOperands(); i != e; ++i) {
         const MachineOperand &mop = mii->getOperand(i);
         if (mop.isRegister() && mop.getReg() &&
             MRegisterInfo::isVirtualRegister(mop.getReg())) {
           // replace register with representative register
           unsigned reg = rep(mop.getReg());
           mii->getOperand(i).setReg(reg);

Doesn't that last statement actually do the rewrite? And how does
this interact with the spill cost computation? I'm thinking separating
out the spill cost computation is going to be a bit more work because
right now it does some of its calculations in concert with this
rewriting. But it seems workable.

Right. r2rmap is dead after coalescing. Chris' suggestion (colaescer owns it) makes sense.

For the same reasons, I would like to eventually separate
the spill cost calculation from the coalescing phase and

This seems fine in principle, but i'd like Evan to chime in when he's
back.

Sure.

While I agree spill cost computation does not belong in coalescer, I am not sure if it should go into a separate pass. To me, spill cost formulas should be register allocator specific. Perhaps it ought to belong to a generic register allocator class. Each derivative allocator is responsible for overriding it and calling it if it deems necessary.

I am not too concerned about the potential increase in compile time. But please measure it nevertheless. :slight_smile: One of these days we will provide a reg -> uses mapping to help to mitigate the cost.

Evan

Chris Lattner wrote:

I'm definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Some suggestions:

@InProceedings{Budimlic02,
  AUTHOR = {Zoran Budimlic and Keith D. Cooper and Timothy J. Harvey
and Ken Kennedy and Timothy S. Oberg and Steven W. Reeves},
  YEAR = "2002",
  TITLE = "Fast copy coalescing and live-range identification",

Yep, this is the one I was thinking of. It is available online here:
http://www.cs.rice.edu/~keith/LACSI/pldi02.pdf

I was just looking at this today. One thing that strikes me about
all these papers I've read on the topic is that no one seems to
consider the interaction of coalescing with spilling. By definition
coalescing increases lifetimes and can cause more interferences.
Yet the papers all try to coalesce as many copies as possible.

On, say, the x86 machines, the cost of a spill is really not much
different from the cost of a copy so it's probably close to a wash in
the end. But there are many machines where the cost of each operation
is vastly different. Aggressive coalescing is not always good. Often
you'd rather copy than spill.

Is anyone aware of publications addressing the interplay among
coalescing, live range splitting, register allocation and spilling?

This is one of the reasons I want to separate all of these concerns.
We shouldn't force developers to always coalesce or always use some
generic measure of spill cost.

These are "implementation details". They don't necessarily make good paper material. :slight_smile:

Obviously, smart heuristics can make a big difference here (estimated register pressures, etc.) But the more important thing is how the passes down stream can recover from the earlier mistakes. By this, we mean live range splitting and re-materialization. Unfortunately, neither of these are feasible given the current infrastructure. Chris and I have been discussing a rewrite informally, but no formal plan has been materialized.

Evan

Yep, this is the one I was thinking of. It is available online here:
http://www.cs.rice.edu/~keith/LACSI/pldi02.pdf

I was just looking at this today. One thing that strikes me about
all these papers I've read on the topic is that no one seems to
consider the interaction of coalescing with spilling. By definition
coalescing increases lifetimes and can cause more interferences.
Yet the papers all try to coalesce as many copies as possible.

Funny you should mention it, we've been (okay, Evan has been :wink: fighting this issue lately.

On, say, the x86 machines, the cost of a spill is really not much
different from the cost of a copy so it's probably close to a wash in
the end. But there are many machines where the cost of each operation
is vastly different. Aggressive coalescing is not always good. Often
you'd rather copy than spill.

Even on X86, load are more expensive than copies (At least on sane implementations).

Is anyone aware of publications addressing the interplay among
coalescing, live range splitting, register allocation and spilling?

This is one of the reasons I want to separate all of these concerns.
We shouldn't force developers to always coalesce or always use some
generic measure of spill cost.

In my mind (a dangerous place to go), it's not about coallescing vs spilling, it's about splitting vs spilling.

In particular, coallescing does increase live ranges of values, but you can have the same increase from code written without copies. Basically, not doing aggressive copy coallescing means that you are dependent on the user telling you where live ranges should be split. In an aggressive compiler, the user doesn't actually have control, as most of the explicit copies are gone by the time regalloc happens. This means that you have an arbitrary set of copies that come from things like 2-addr elim, phi elim, extensions that don't generate code, etc.

Instead of doing this. I think the right design is to:

1. Coallesce vregs together as aggressively as possible. Build huge live ranges, delete as many copies as possible.
2. Start priority based register allocation.
3. For each live range without a register available, consider:
    1. live range splitting
    2. remat
    3. spilling

In particular, by doing live range splitting, the *register allocator* can control where the copies go, rather than relying on luck that previous passes introduce copies in convenient places.

In practice, right now we have no splitting capability. In addition, we coallesce physregs with vregs, which can end up pinning a physreg across an entire function even if it has few uses. This is all very bad, Evan is working on making the coallescer more careful about coallescing vregs with pregs now.

In the next 6 months or so, I'd like to start investigating a bigger refactoring of linear scan, which will hopefully allow us to build in some of the smarter techniques like aggressive remat, splitting, etc. The first step to getting that is to make sure the datastructures (e.g. live intervals) are sane.

-Chris

I think in the paper I mentioned above (“Design and Implementation of GCRA in GCC”; though the implementation failed at last the paper itself contains a lot of useful information) there are descriptions of different coalescing techniques: iterated, optimistic and extended coalescing. You can find the papers on these methods at citeseer, for example. Here are the links:
http://citeseer.ist.psu.edu/park98optimistic.html for optimistic and
http://citeseer.ist.psu.edu/197136.html for iterated coalescing.

The main idea is to try to split live ranges of the coalesced virtual regs that haven’t got a color during the select step and to color them separately. I think this is exactly what Chris suggests (with rematerialization, too, of course).

Chris Lattner wrote:
>>> I'm definitely interested in improving coalescing and it sounds like
>>> this would fall under that work. Do you have references to papers
>>> that talk about the various algorithms?
>> Some suggestions:
>>
>> @InProceedings{Budimlic02,
>> AUTHOR = {Zoran Budimlic and Keith D. Cooper and Timothy J. Harvey
>> and Ken Kennedy and Timothy S. Oberg and Steven W. Reeves},
>> YEAR = "2002",
>> TITLE = "Fast copy coalescing and live-range identification",
>
> Yep, this is the one I was thinking of. It is available online here:
> http://www.cs.rice.edu/~keith/LACSI/pldi02.pdf

I was just looking at this today. One thing that strikes me about
all these papers I've read on the topic is that no one seems to
consider the interaction of coalescing with spilling.
  By definition
coalescing increases lifetimes and can cause more interferences.
Yet the papers all try to coalesce as many copies as possible.

Most papers I read these days try to coalesce anything that won't
cause spills. Some of these apparoaches are to coalesce everything,
then undo as necessary to avoid spills, etc.

But certainly, I don't believe recent papers in the area of coalescing
ignore spill costs.

Is anyone aware of publications addressing the interplay among
coalescing, live range splitting, register allocation and spilling?

Yes, they are around, i'll dig them out of my pdfs :slight_smile:

Evan Cheng wrote:

Obviously, smart heuristics can make a big difference here (estimated register pressures, etc.) But the more important thing is how the passes down stream can recover from the earlier mistakes. By this, we mean live range splitting and re-materialization.

Can you explain this a little more? Do you mean that neither live
range splitting nor rematerialization are practically possible
with LLVM right now? If that's the case, what are the primary issues?
These are things I will definitely have to do here so I'm quite
interested in how this will all work. I'm more than happy to do
a bunch of heavy lifting getting things into shape.

                                -Dave

Chris Lattner wrote:

1. Coallesce vregs together as aggressively as possible. Build huge live ranges, delete as many copies as possible.
2. Start priority based register allocation.
3. For each live range without a register available, consider:
    1. live range splitting
    2. remat
    3. spilling

In particular, by doing live range splitting, the *register allocator* can control where the copies go, rather than relying on luck that previous passes introduce copies in convenient places.

Yes, I think that's a good approach. I was thinking along the same
general lines.

In practice, right now we have no splitting capability. In addition, we coallesce physregs with vregs, which can end up pinning a physreg across an entire function even if it has few uses. This is all very bad, Evan is working on making the coallescer more careful about coallescing vregs with pregs now.

This gets into a discussion I had with Preston Briggs once. He suggests
doing "pre-coloring" (putting virtuals into physicals for needed
operations like shift on X86) and 2-address transformations within the
register allocator itself. I'll have to dig up my notes but he
convinced me it was the right way to go. A previous compiler I worked
on did the 2-address transformation just like LLVM and he pointed out
some weaknesses in that to me.

That's another avenue I'd like to explore in my copious spare time. :-/

In the next 6 months or so, I'd like to start investigating a bigger refactoring of linear scan, which will hopefully allow us to build in some of the smarter techniques like aggressive remat, splitting, etc. The first step to getting that is to make sure the datastructures (e.g. live intervals) are sane.

Great! Hopefully I'll gain a bunch of experience with regalloc by
then and can contribute some insights.

                                     -Dave

David Greene wrote:

Evan Cheng wrote:

Obviously, smart heuristics can make a big difference here (estimated register pressures, etc.) But the more important thing is how the passes down stream can recover from the earlier mistakes. By this, we mean live range splitting and re-materialization.

Can you explain this a little more? Do you mean that neither live

Oops, copied the wrong part of the paragraph. I meant to respond to this:

> Unfortunately, neither of these are feasible given the current
> infrastructure.

                                -Dave

Evan Cheng wrote:

While I agree spill cost computation does not belong in coalescer, I am not sure if it should go into a separate pass. To me, spill cost formulas should be register allocator specific. Perhaps it ought to belong to a generic register allocator class. Each derivative allocator is responsible for overriding it and calling it if it deems necessary.

I'm actually fairly nervous about using inheritance this way. I know
it's done like this all over LLVM but the Template Method pattern is
often better. You don't _really_ need dynamic polymorphism for things
like this. Static polymorphism is sufficient and it has the advantage
of more flexibility for reuse (doesn't depend on a specific base class).

For example, if I'm writing a register allocator, one way to do what
you're saying without inheritance is to parameterize the allocator
with a traits class and call into that for the custom routines:

template<class RegallocTraits>
class GCRegAlloc {
    void doAlloc() {
       ...
       RegallocTraits::computeSpillCost(someValue);
       ...
    };
};

An alternative to think about during these discussions.

                              -Dave

Evan Cheng wrote:

Obviously, smart heuristics can make a big difference here (estimated
register pressures, etc.) But the more important thing is how the
passes down stream can recover from the earlier mistakes. By this, we
mean live range splitting and re-materialization.

Can you explain this a little more? Do you mean that neither live
range splitting nor rematerialization are practically possible
with LLVM right now?

Anything is possible. :slight_smile:

In practice, there are good ways and bad ways to build things. For example, Evan recently implemented a really simple form of remat that works on instructions that have no register inputs (e.g. things like "load immediate").

That said, we want to implement the full generality of remat and splitting and have them work together well in the same framework. The key to this is getting the right data structure.

If that's the case, what are the primary issues?

Evan is the best to respond to this at this point :slight_smile:

These are things I will definitely have to do here so I'm quite
interested in how this will all work. I'm more than happy to do
a bunch of heavy lifting getting things into shape.

Great!

-Chris

Evan Cheng wrote:

Obviously, smart heuristics can make a big difference here (estimated
register pressures, etc.) But the more important thing is how the
passes down stream can recover from the earlier mistakes. By this, we
mean live range splitting and re-materialization.

Can you explain this a little more? Do you mean that neither live
range splitting nor rematerialization are practically possible
with LLVM right now?

Anything is possible. :slight_smile:

In practice, there are good ways and bad ways to build things. For
example, Evan recently implemented a really simple form of remat that
works on instructions that have no register inputs (e.g. things like
"load immediate").

That said, we want to implement the full generality of remat and splitting
and have them work together well in the same framework. The key to this
is getting the right data structure.

If that's the case, what are the primary issues?

Evan is the best to respond to this at this point :slight_smile:

There are a lot of implementation details. Many of which I don't have clear understanding yet. Take splitting for example. The allocator needs to pick a point and recompute conflicts. Can we do that with the given infrastructure? Perhaps. But not without pain. The current data structure is lacking detailed def-use info to properly determine the correct split point. The register allocator even treat "fixed" (i.e. physical register) intervals separately from other active ones.

The point is, the current code needs a lot of massaging. If you are interested (after your first rounds of refactoring), please consider a replacement for LiveInterval that keeps more accurate info that can be plugged into the same existing allocator. The next step is to do away with the r2rmap and rep().

Thanks,

Evan

That's all implementation detail. My main point is spill weight computation is tied to the particular allocator / spiller combo so it should not be a separate pass.

Evan

This is a very detailed design point. I don't think it makes sense to talk about this until we are further along :). I agree that there are pros an cons, but I see these (static specialization vs dynamic dispatch) as two equivalent ways to solve the same problem, just with two different sets of trade-offs. With one you have code duplication (of an entire register allocator??) on the other hand you have dynamic dispatch overhead.

In practice, the only way to determine the worth of one approach over the other is careful measurment, which you can't do until you have an implementation :slight_smile:

-Chris

While implementing Wimmer's linear scan, which is heavily based on
splitting of live intervals, I already extended LiveInterval to contain
more detailed info about all def-use occurrences. I also implemented
the functionality for splitting a live interval at any place.
Additionally, the logic for selecting an "optimal" splitting position
is done according to Wimmer's paper.

Classes that I changed were: LiveInterval, LiveIntervalAnalysis,
RegAllocLinearScan. To avoid conflicts with recent changes by Evan, I
did the modifications on my copies of these classes which I named
WimmerLiveInterval, WimmerLiveIntervalAnalysis,
WimmerRegAllocLinearScan.

I'm currently working on documenting the code and adding comments as
well as some refactoring to make it cleaner. I plan to contribute the
code rather soon, hopefully in May. In case of interest, I could post
the code for my WimmerLiveInterval and WimmerLiveIntervalAnalysis
classes even earlier.

Overall, this register allocator is rather functional already, at least
for X86.

The remaining issues with the allocator are:
- Better coalescing decisions (here some insights from Fernando's paper
and may be some others could help)

- Reducing the number of moves between splitted parts of live
intervals, when they are allocated to different physical registers.

- Use of spill weights information and other heuristics for selection
of a live interval to spill - currently this is not taken into account
at all, as in Wimmer's paper.

- Handling of rematerialization

- Better handling of memory folding - Wimmer's linear scan does it
during register allocation time and only conditionally, and not in
LiveIntervalAnalysis as it is done now. It distinguishes between SHOULD
and MUST occurrences of the register in instructions. But current LLVM
infrastructure for memory folding does it unconditionally. This
requires probably some minor changes, e.g. providing different
functions for asking if folding is potentially possible and for doing
memory folding later, if the decision to do so is taken.

-Roman

Evan Cheng wrote:

the given infrastructure? Perhaps. But not without pain. The current data structure is lacking detailed def-use info to properly determine the correct split point. The register allocator even treat "fixed" (i.e. physical register) intervals separately from other active ones.

The point is, the current code needs a lot of massaging. If you are interested (after your first rounds of refactoring), please consider a replacement for LiveInterval that keeps more accurate info that can be plugged into the same existing allocator. The next step is to do away with the r2rmap and rep().

This gets back to my questions about what LiveIntervals lacks.

- What "data structure" are you referring to above?

- What specific def-use info is needed that we don't have access
   to currently?

- A couple of weeks ago I asked about the comment in
   LiveIntervalAnalysis referring to "conservative" computation.
   What's conservative about it? What information does it lack that
   a traditional live variable/live range analysis provides?

   The answers I got back said essentially that it's not conservative
   at all and can't be improved upon. I'm getting a different sense
   from you so I'd like to understand this better.

                                  -Dave

Chris Lattner wrote:

This is a very detailed design point. I don't think it makes sense
to talk about this until we are further along :).

It's actually a fundamental design decision and therefore needs to
be talked about up front. I understand where you're coming from
but static polymorphism is the design I'd choose. It's important to
get the sense of the community about this early on so I don't go do
something that people object to.

I agree that there are pros an cons, but I see these (static
specialization vs dynamic dispatch) as two equivalent ways to solve
the same problem, just with two different sets of trade-offs. With
one you have code duplication (of an entire register allocator??) on
the other hand you have dynamic dispatch overhead.

This isn't static specialization. It's the Template Method pattern.
There's one generic algorithm that's parameterized through one or
more traits classes. The generic algorithm calls into the traits
classes to get varying behavior.

There is no code duplication. Let's say I write two register
allocators. One uses linear scan and the other uses graph coloring.
Both need to compute spill weights. A single traits class can
be reused as a parameter to both allocators, allowing the spill
weight computation code to be shared. If someone else comes
along later with a better spill weight algorithm, we can substitute
that traits class when we parameterize the allocators.

An example is the use of allocators in the standard library. The
code for std::vector isn't duplicated. If I want a different
allocation strategy I write a new allocator class and pass that
as part of the vector type.

In practice, the only way to determine the worth of one approach over
the other is careful measurment, which you can't do until you have an
implementation :slight_smile:

The detailed part of the design is exactly what goes into the
traits class(es), and yes, that is something that will be hashed
out with experience. But choosing static vs. dynamic polymorphism
has important software engineering implications. It's not a
simple matter to convert from one to the other mid-stream.

                           -Dave

Who's your advisor?

-scooter
  (aka "Dr. B. Scott Michel, UCLA CS 2004" :slight_smile:

Ooooopppsssss... wasn't meant to go to the list.

-scooter

Chris Lattner wrote:

This is a very detailed design point. I don't think it makes sense
to talk about this until we are further along :).

It's actually a fundamental design decision and therefore needs to
be talked about up front. I understand where you're coming from
but static polymorphism is the design I'd choose. It's important to
get the sense of the community about this early on so I don't go do
something that people object to.

I agree that there are pros an cons, but I see these (static
specialization vs dynamic dispatch) as two equivalent ways to solve
the same problem, just with two different sets of trade-offs. With
one you have code duplication (of an entire register allocator??) on
the other hand you have dynamic dispatch overhead.

This isn't static specialization. It's the Template Method pattern.
There's one generic algorithm that's parameterized through one or
more traits classes. The generic algorithm calls into the traits
classes to get varying behavior.

Template instantiation is static specialization. That, unfortunately, causes a lot of code duplication. It's not suitable for many classes. We are very concerned about overall footprint of LLVM.

Seriously, it's not important to get every design point decided before we proceed. We like incremental development. :slight_smile: See http://llvm.org/docs/DeveloperPolicy.html#incremental

Evan