Regalloc Refactoring

Hi all,

As I work toward improving LLVM register allocation, I've
come across the need to do some refactoring. Specifically,
I would like to separate register coalescing from live
interval analysis. Right now LiveIntervals handles both.
The reason I want to separate them is that other types of
register allocators might like to do coalescing differently
(e.g. graph coloring does it by examining the interference
graph). It would also allow different heuristics and
coalescing strategies.

So far, this has been pretty easy. I've created a new
SimpleRegisterCoalescing pass and the member functions
divide pretty cleanly between than and LiveIntervals.

For the same reasons, I would like to eventually separate
the spill cost calculation from the coalescing phase and
put the state somewhere besides the LiveInterval class but
that's a project for later. Doing this would increase
compile time slightly as it would require an extra pass
over the program. Is this ok?

The problem is LiveIntervals::CreateNewLiveInterval. This
member references LiveIntervals::rep, which as far as I can
tell makes use of temporary state (r2rMap_) generated
during the coalescing pass. My reading indicates that
at final loop nest of LiveIntervals::runOnMachineFunction
replaces operand registers with using rep() which makes
use of r2rMap_.

So why does LiveIntervals::CreateNewLiveInterval use r2rMap_?
Actually, in the official sources, this member is not used by
anyone according to the Doxygen pages. The only use I see is
in RegAllocGraphColoring.cpp which was posted to the mailing
list some time ago.

So I have several questions:

- Does this refactoring sound like a good idea? Would a patch
   be accepted?

- Is my plan to separate spill cost calculation from coalescing sound?

- Where do I send patches for approval?

- What is LiveIntervals::CreateNewLiveInterval used for?

- Why does LiveIntervals::CreateNewLiveInterval reference r2rMap_?

                                 -Dave

Hi David,

As I work toward improving LLVM register allocation, I've
come across the need to do some refactoring. Specifically,
I would like to separate register coalescing from live
interval analysis. Right now LiveIntervals handles both.
The reason I want to separate them is that other types of
register allocators might like to do coalescing differently
(e.g. graph coloring does it by examining the interference
graph). It would also allow different heuristics and
coalescing strategies.

My experience with implementing Wimmer's version of the linear scan
also shows that register coalescing should be a separate pass, possibly
even selectable independently from the register allocation algorithm.
And for graph coloring register allocators coalescing is usually done
very differently, as you point out.

So far, this has been pretty easy. I've created a new
SimpleRegisterCoalescing pass and the member functions
divide pretty cleanly between than and LiveIntervals.

For the same reasons, I would like to eventually separate
the spill cost calculation from the coalescing phase and
put the state somewhere besides the LiveInterval class but
that's a project for later. Doing this would increase
compile time slightly as it would require an extra pass
over the program. Is this ok?

I also agree that it is cleaner to have it in a separate source file in
a more modular fashion. If you don't want to increase the compile time,
may be some sort of plugin or policy driven approach approach can be
used? In this case, the pass over the program is still done by the
LiveIntervalAnalysis, but the logic for cost computation could be
implemented somewhere else and would be just called from the
LiveIntervalAnalysis. The kind of spill cost computation could be even
defined by a command-line option.

The problem is LiveIntervals::CreateNewLiveInterval. This
member references LiveIntervals::rep, which as far as I can
tell makes use of temporary state (r2rMap_) generated
during the coalescing pass. My reading indicates that
at final loop nest of LiveIntervals::runOnMachineFunction
replaces operand registers with using rep() which makes
use of r2rMap_.

So why does LiveIntervals::CreateNewLiveInterval use r2rMap_?
Actually, in the official sources, this member is not used by
anyone according to the Doxygen pages. The only use I see is
in RegAllocGraphColoring.cpp which was posted to the mailing
list some time ago.

r2rMap_ is used for selecting a representative register for a group of
coalesced registers. This is required during coalescing and for
rewriting of the code, when replacing the coalesced registers by
representative ones. This approach is used by many linear scan
implementations. But it is not required for many other kinds of
register allocations. For example, in Wimmer's version of linear scan
coalescing is done during the allocation and therefore the
implementation maintains its own mapping similar to r2rMap_.

In my opinion, current implementation of LiveIntervalAnalysis as well
some other related things is very tied to the specific flavor of the
linear scan register allocation algorithm and is not generic enough to
support many other kinds of register allocation. So, any work making it
more generic would open possibilities for easier addition of new
register allocators to LLVM.

So I have several questions:

- Does this refactoring sound like a good idea?

Yes.

- Is my plan to separate spill cost calculation from coalescing
sound?

Most likely - yes.

- What is LiveIntervals::CreateNewLiveInterval used for?

I guess for splitting live intervals and similar things.

-Roman

As I work toward improving LLVM register allocation, I've
come across the need to do some refactoring.

cool. :slight_smile: One request: Evan is currently out on vacation until Monday. This is an area that he is very interested in and will want to chime in on. Please don't start anything drastic until he gets back :).

Specifically, I would like to separate register coalescing from live interval analysis. Right now LiveIntervals handles both. The reason I want to separate them is that other types of register allocators might like to do coalescing differently (e.g. graph coloring does it by examining the interference graph). It would also allow different heuristics and coalescing strategies.

Ok. Another thing to ponder on: Our current phi elimination pass is the source of many of our coallescing-related issues. Currently, each phi is lowered into one copy for the result and one copy for each input. This is a *lot of copies*! This is a problem both for coallescing (because it has to do so much, it is required to be very aggressive) and compile time (phi elim + coallescing is much slower than inserting the right copies to begin with).

If you're interested in this, I'd suggest taking a look at some of the smart phi elimination algorithms which use dominance properties (i.e., they don't require an interference graph).

So far, this has been pretty easy. I've created a new
SimpleRegisterCoalescing pass and the member functions
divide pretty cleanly between than and LiveIntervals.

Beyond that, one of the issues is the "r2rmap" and "rep" function. As you've noticed, the coallescer basically uses these to avoid rewriting the code after it does coallescing. For example, if r1024 is coallesced with r1026, it leaves all uses of both registers in the code, instead of rewriting uses of r1026 with r1024 and deleting all memory of r1026. This mades sense long ago, but now it is increasingly clear that this is a bad idea. I would much rather have the coallescer be responsible for updating the code. I would suggest doing this as a first step, it is clearly goodness :slight_smile:

For the same reasons, I would like to eventually separate
the spill cost calculation from the coalescing phase and
put the state somewhere besides the LiveInterval class but
that's a project for later. Doing this would increase
compile time slightly as it would require an extra pass
over the program. Is this ok?

This seems fine in principle, but i'd like Evan to chime in when he's back.

So I have several questions:

- Does this refactoring sound like a good idea? Would a patch
  be accepted?

Yes, but try to break this down into small pieces: first step is to eliminate rep/r2rmap. Second step is to split coallescer off. PHI elim changes are independent of this, but would also be a separate project.

- Is my plan to separate spill cost calculation from coalescing sound?

Seems fine to me, but I don't know enough :slight_smile:

- Where do I send patches for approval?

llvm-commits mailing list please.

-Chris

Chris Lattner wrote:

As I work toward improving LLVM register allocation, I've
come across the need to do some refactoring.

cool. :slight_smile: One request: Evan is currently out on vacation until Monday. This is an area that he is very interested in and will want to chime in on. Please don't start anything drastic until he gets back :).

Will do. Right now I'm mostly just experimenting. We keep our
own copy of the repository here so it's easy to back out changes.

If you're interested in this, I'd suggest taking a look at some of the smart phi elimination algorithms which use dominance properties (i.e., they don't require an interference graph).

I'm definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Beyond that, one of the issues is the "r2rmap" and "rep" function. As you've noticed, the coallescer basically uses these to avoid rewriting the code after it does coallescing. For example, if r1024 is coallesced with r1026, it leaves all uses of both registers in the code, instead of rewriting uses of r1026 with r1024 and deleting all memory of r1026. This mades sense long ago, but now it is increasingly clear that this is a bad idea. I would much rather have the coallescer be responsible for updating the code. I would suggest doing this as a first step, it is clearly goodness :slight_smile:

This is what I was trying to get at, but wasn't entirely clear about.
Does the loop nest after the call to joinIntervals in LiveIntervals::runOnMachineFunction do this rewrite? Specifically:

   // perform a final pass over the instructions and compute spill
   // weights, coalesce virtual registers and remove identity moves.
   const LoopInfo &loopInfo = getAnalysis<LoopInfo>();

   for (MachineFunction::iterator mbbi = mf_->begin(), mbbe = mf_->end();
        mbbi != mbbe; ++mbbi) {
     [...]

       for (unsigned i = 0, e = mii->getNumOperands(); i != e; ++i) {
         const MachineOperand &mop = mii->getOperand(i);
         if (mop.isRegister() && mop.getReg() &&
             MRegisterInfo::isVirtualRegister(mop.getReg())) {
           // replace register with representative register
           unsigned reg = rep(mop.getReg());
           mii->getOperand(i).setReg(reg);

Doesn't that last statement actually do the rewrite? And how does
this interact with the spill cost computation? I'm thinking separating
out the spill cost computation is going to be a bit more work because
right now it does some of its calculations in concert with this
rewriting. But it seems workable.

For the same reasons, I would like to eventually separate
the spill cost calculation from the coalescing phase and

This seems fine in principle, but i'd like Evan to chime in when he's back.

Sure.

Yes, but try to break this down into small pieces: first step is to eliminate rep/r2rmap. Second step is to split coallescer off. PHI elim changes are independent of this, but would also be a separate project.

Good work plan. That's how I'll submit the patches.

                                -Dave

I'm definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Some suggestions:

@InProceedings{Budimlic02,
   AUTHOR = {Zoran Budimlic and Keith D. Cooper and Timothy J. Harvey and Ken Kennedy and Timothy S. Oberg and Steven W. Reeves},
   YEAR = "2002",
   TITLE = "Fast copy coalescing and live-range identification",
   BOOKTITLE = "PLDI",
   PAGES = "25-32",
   PUBLISHER = "ACM Press"
}

@InProceedings{Sreedhar99,
   AUTHOR = "Vugranam C. Sreedhar and Roy Dz-ching Ju and David M. Gillies and Vatsa Santhanam",
   YEAR = 1999,
   TITLE = "Translating out of Static Single Assignment Form",
   booktitle = {SAS},
   publisher = {Springer-Verlag},
   pages = {194-210}
}

And I have a quite fast algo that I believe is simpler than [Budimlic02] and I can share it with you :slight_smile:

Fernando

I’m definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Probably, you’ve seen this paper: http://gcc-ca.internet.bs/summit/2003/Graph%20Coloring%20Register%20Allocation.pdf
The author depicts everything you could do with register allocation :slight_smile: He touches different ways of coalescing, spilling, simplifying, splitting etc. Good place to get started.

And I have a quite fast algo that I believe is simpler than [Budimlic02]
and I can share it with you :slight_smile:

Do you have a paper on this? I'd be interested in seeing it.

-Tanya

And I have a quite fast algo that I believe is simpler than [Budimlic02]
and I can share it with you :slight_smile:

Do you have a paper on this? I'd be interested in seeing it.

Yes, I have a tech report on this page:

http://compilers/fernando/projects/soc/

and I have submitted a paper to SAS, and now I am waiting for the review. The coalescing algorithm is described in sec 4.3. It takes about 10% of the time used in Live Variables analysis in the built in LLVM compiler:

    0.0846 ( 1.2%) 0.0009 ( 2.1%) 0.0855 ( 1.2%) 0.0855 ( 1.2%) Live Variable Analysis
    0.0737 ( 1.0%) 0.0009 ( 2.1%) 0.0746 ( 1.1%) 0.0746 ( 1.0%) Interval Analysis - Fernando.
    0.0361 ( 0.5%) 0.0007 ( 1.6%) 0.0368 ( 0.5%) 0.0368 ( 0.5%) Loop Strength Reduction
    0.0146 ( 0.2%) 0.0003 ( 0.6%) 0.0149 ( 0.2%) 0.0149 ( 0.2%) Canonicalize natural loops
    0.0134 ( 0.2%) 0.0000 ( 0.1%) 0.0135 ( 0.2%) 0.0135 ( 0.1%) Natural Loop Construction
    0.0134 ( 0.2%) 0.0000 ( 0.1%) 0.0135 ( 0.1%) 0.0135 ( 0.1%) Phi mem coalescer - Fernando. <-- **** is this pass here.

Fernando

Hi,

Fernando wrote:

Yes, I have a tech report on this page:

http://compilers/fernando/projects/soc/

That's http://compilers.cs.ucla.edu/fernando/projects/soc/ in case
anyone had trouble finding it.

Cheers,

Ralph.

Fernando Magno Quintao Pereira wrote:

Yes, I have a tech report on this page:

http://compilers/fernando/projects/soc/

Cool. I'm also interested in the Chordal Graph allocator. Are
you planning to integrate it into the main llvm repository? It
would make an interesting research project to compare allocation
algorithms on real-world machines.

                                 -Dave

Yes, I have a tech report on this page:

http://compilers/fernando/projects/soc/

Cool. I'm also interested in the Chordal Graph allocator. Are
you planning to integrate it into the main llvm repository? It
would make an interesting research project to compare allocation
algorithms on real-world machines.

Oops, the link was not correct. Should be:
http://compilers.cs.ucla.edu/fernando/projects/soc/

About the integration: I have a SSA-based allocator up and running. It is slow, because it has not been implemented directly in LLVM. I am currently trying to adapt RegAllocLinearScan.cpp to run it. But this will not be done soon, for I am having to divide my time with other project. I don't plan to build the graph-based chordal allocator, but I can help someone who is willing to do it. The algorithm does not have patent :slight_smile: The problem is that to build the graph is slow, and it does not bring advantages in terms of better results than using the linear-scan based version. A linear scan traversal in a SSA-form control flow graph produces optimal results, if you don't have fixed-registers on the way. It can be done as fast as in a traditional linear scan, and it spills much less, because of the smaller live ranges.

Fernando

I'm definitely interested in improving coalescing and it sounds like
this would fall under that work. Do you have references to papers
that talk about the various algorithms?

Some suggestions:

@InProceedings{Budimlic02,
  AUTHOR = {Zoran Budimlic and Keith D. Cooper and Timothy J. Harvey
and Ken Kennedy and Timothy S. Oberg and Steven W. Reeves},
  YEAR = "2002",
  TITLE = "Fast copy coalescing and live-range identification",

Yep, this is the one I was thinking of. It is available online here:
http://www.cs.rice.edu/~keith/LACSI/pldi02.pdf

And I have a quite fast algo that I believe is simpler than [Budimlic02]
and I can share it with you :slight_smile:

Can you briefly explain what it simplifies?

-Chris

Beyond that, one of the issues is the "r2rmap" and "rep" function. As
you've noticed, the coallescer basically uses these to avoid rewriting the
code after it does coallescing. For example, if r1024 is coallesced with
r1026, it leaves all uses of both registers in the code, instead of
rewriting uses of r1026 with r1024 and deleting all memory of r1026. This
mades sense long ago, but now it is increasingly clear that this is a bad
idea. I would much rather have the coallescer be responsible for updating
the code. I would suggest doing this as a first step, it is clearly
goodness :slight_smile:

This is what I was trying to get at, but wasn't entirely clear about.
Does the loop nest after the call to joinIntervals in
LiveIntervals::runOnMachineFunction do this rewrite?

I didn't think anything did, but...

Specifically:

  // perform a final pass over the instructions and compute spill
  // weights, coalesce virtual registers and remove identity moves.
  const LoopInfo &loopInfo = getAnalysis<LoopInfo>();

  for (MachineFunction::iterator mbbi = mf_->begin(), mbbe = mf_->end();
       mbbi != mbbe; ++mbbi) {
    [...]

      for (unsigned i = 0, e = mii->getNumOperands(); i != e; ++i) {
        const MachineOperand &mop = mii->getOperand(i);
        if (mop.isRegister() && mop.getReg() &&
            MRegisterInfo::isVirtualRegister(mop.getReg())) {
          // replace register with representative register
          unsigned reg = rep(mop.getReg());
          mii->getOperand(i).setReg(reg);

Doesn't that last statement actually do the rewrite?

Hrm, yes, yes it appears so. Question is: doesn't this make the r2r map dead? Does something else fill it in? My memory is hazy here :). If it is dead, we should rip it out (actually, we should make it private to the coallescer function).

And how does this interact with the spill cost computation? I'm thinking separating out the spill cost computation is going to be a bit more work because right now it does some of its calculations in concert with this rewriting. But it seems workable.

Makes sense to me.

Yes, but try to break this down into small pieces: first step is to
eliminate rep/r2rmap. Second step is to split coallescer off. PHI elim
changes are independent of this, but would also be a separate project.

Good work plan. That's how I'll submit the patches.

Great!

-Chris

Can you briefly explain what it simplifies?

Let me try to explain my algorithm first, and then you can try to compare with the others already described.

The objective is to convert the SSA-form program into a conventional SSA-form program (CSSA). In this flavor of SSA, the variables related by phi-functions do not interfere, and can be assigned the same register, as in Budimlic's case, or the same memory address if spilled (that is what I am doing). One way to convert a program in CSSA-form is to split the live range of each virtual that is part of a phi function. You can do this using the algorithm below:

PhiLifting
   For each instruction
   (a_i := phi(a_{i1}:B_1, a_{i2}:B_2, ..., a_{im}:B_m))
     Create new virtual v_i
       Add copy instruction I = ( a_i := v_i ) at the end of block B_j
       Replace a_i by v_i in phi
       For each virtual a_{ij} in phi
         Create new virtual v_{ij}
         Add copy instruction I = (v_{ij} := a_{ij})
         at the end of block B_j
         Replace a_{ij} by v_{ij} in \phi

This algorithm will add one copy for each virtual that is a parameter of a phi-function, and one copy for each virtual that is the result of a phi-function. After you run it, given a phi function (a := phi(b, c)), you can assign the same register to a, b and c, because their live ranges don't overlap. But this is too convervative, because now the live ranges are very small. In order to increase this, you can merge classes of phi-related virtuals using the algorithm below:

PhiMemCoalesce
(S_I = { I_1, I_2, ..., I_q } // instructions created by PhiLifting.
S_Q = { Q_1, Q_2, ..., Q_r }) /// Classes of phi-related virtuals.
For each instruction I = ( v_{ij} := a_{ij} ) in S_I
     S_I := S_I - { I }
         Let Q_v be the equivalence class of v_{ij}
             Let Q_a be the equivalence class of a_{ij}
                 if Q_v intersection Q_v = {} // they don't overlap
                     S_Q := S_Q - { Q_v }
                     S_Q := S_Q - { Q_a }
                     S_Q := S_Q + { Q_a union Q_v }

Virtuals that are in the same phi function are said to be in the same equivalence class of phi-related virtuals. For instance, for a := phi(b, c), the equivalence class is Q = {a, b, c}.

To make it efficient, I do not build the interference graph of the source program in order to perform operations such as 'Q_i intersection Q_j'. Instead, I rely on the interval representation of live ranges commonly used in versions of the linear scan algorithm. Each virtual is represented as a collection of ordered intervals on the linearized control flow graph. Thus, a set Q of virtuals is a set of ordered integer intervals. In this way, the intersection of two phi-equivalence classes Q_i and Q_j can be found in time linear on the number of disjoint intervals in both sets. Because a phi-equivalence class can have O(V) disjoint intervals, the final complexity of algorithm PhiMemCoalesce is O(|L_i| * V). In my experiments, the interval based implementation of PhiMemCoalesce accounts for less than 1% of the total compilation time, whereas the interference graph based algorithm takes more than 30% of the compilation time!

Fernando

Chris Lattner wrote:

Doesn't that last statement actually do the rewrite?

Hrm, yes, yes it appears so. Question is: doesn't this make the r2r map dead? Does something else fill it in? My memory is hazy here :). If it is dead, we should rip it out (actually, we should make it private to the coallescer function).

I'm trying an experiment to eliminate the r2r map altogether. Is there
an efficient way to replace all references to a virtual register? That
is, given a virtual register number, is there an easy way to get a list
of all Instructions and/or Operands that reference it?

I've been looking at the Use and Value classes and am wondering if there
is something there I could use.

                                -Dave

No there isn't, unfortunately. I'd suggest building up/maintaining the r2r map inside the coallescer. Once the coallescer is done with the entire function, do a single pass over the function rewriting all the coallesced vregs.

-Chris

Chris Lattner wrote:

No there isn't, unfortunately. I'd suggest building up/maintaining the r2r map inside the coallescer. Once the coallescer is done with the entire function, do a single pass over the function rewriting all the coallesced vregs.

Ok. I have a version with the coalescer separated from
liveIntervalAnalysis. It still uses the r2r map but as we
discussed late last week, it looks like the rewrite is
already done. I will make the r2r map and all APIs into
it private within the coalescer class and submit a patch.

Good?

                              -Dave

Hey, David.

     just a curiosity: does your changes have any impact on the compilation time? E.g, did you try comparing the old version with the modified version using llc -time-passes?

best,

Fernando

Sounds excellent,

-Chris

Fernando Magno Quintao Pereira wrote:

Hey, David.

     just a curiosity: does your changes have any impact on the compilation time? E.g, did you try comparing the old version with the modified version using llc -time-passes?

I haven't tested it yet (waiting on some updates from colleagues here)
but there should be no time increase. Quite literally, it just involves
moving member functions from one class to another, creating a new pass
and adding in the proper AnalysisUsage information.

Separating out the spill cost computation will increase compile time
but I don't imagine it will be significant.

                                   -Dave