Supporting Complex Register Allocation Constraints (PBQP Allocator Status Update)

Hi All,

I’ve just committed some changes to the PBQP allocator which are designed to make it easier to implement custom register allocation constraints. This is a quick summary of those changes, and of the status of the PBQP allocator in general.

First a quick bit of background:

The PBQP allocator is based on ideas described in [1]. I implemented this algorithm (with the improved heuristic from [2]) in LLVM, and committed it to the mainline back in October 2008. Since then it has consistently passed correctness tests (compiling all benchmarks for which linear scan also succeeds, including the SPEC benchmarks), and produces similar code quality to linear scan on average (on my AMD64 test box at least). It currently has significant compile-time costs, requiring more memory and roughly triple the total compile time on average (compared to linear scan). Hopefully these will come down somewhat if/when I find more time to devote to optimizing the allocator, however I believe they will remain on the high end. The reason for this (and at the same time the source of PBQP’s strengths) is the generality of the mathematical optimization problem which underlies the allocator. This optimization problem allows a wide variety of register allocation costs and constraints to be expressed in a straightforward manner. If you have a target architecture with complex constraints and you want to get start getting good allocations with minimal engineering work, this may be the allocator for you.

Here’s how the PBQP allocator works:
(Note: This section dives in to a few details of how PBQP works, and how to use it, so you can skip it if you’re not interested)

Starting with some pseudo-C+±code for the allocator, and taking a very high-level view:

void pbqpAllocate() {
do {
problem = builder->buildPBQPProblem(machineFunction);
solution = solvePBQP(problem);
allocation = mapToAllocation(solution);
} while (containsSpills(allocation));
}

Essentially the allocator loops mapping the current machine function to an optimization problem, solving the problem, them mapping the solution back to an allocation. If there are any spills then they’re applied (using LLVM’s spilling infrastructure), and iteration terminates when there are no further spills required.

It’s a very similar process to graph coloring, only instead of building an interference graph and solving a k-colouring problem, you’re building a PBQP instance and minimizing an objective function. The good news is that there’s a PBQP solver implemented in LLVM, and the mapping from solutions to allocations is generic, so the only thing that needs to be done to support a new architecture is to supply a mapping from machine functions to PBQP instances for the new target. If you can do that, the existing infrastructure can handle the rest. (A caveat is appropriate here: PBQP is only solving the labeling problem. That’s powerful, but it doesn’t solve every problem in regalloc. If you have issues with the spiller/rewriter, or need high-quality bank-selection, or something like that, I’m afraid this won’t help).

So how do you write mappings?

First you need to know what a PBQP representation of a register allocation problem looks like. They’re represented as graphs, and those who know graph coloring will find the structure familiar. Each virtual register in the machine function is represented by a node in the PBQP graph, and constraints between virtual registers (such as interference) are represented as edges. in the PBQP graph each node is associated with a cost vector, which gives the cost of each allocation option for the virtual register. E.g.

spill 500 // ← Spill cost estimate
ax 0
cx 0
si 0

Typically you will use the spill cost estimate for the spill option, and zero for each register, however you can attach positive or negative costs to the individual registers to preference them as you like (which is handy if you’ve got preferred registers with shorter instruction encodings for instance).

Each edge in the PBQP graph is associated with a cost matrix. This gives the cost for combinations of allocations for the virtual registers it connects. So to represent an interference constraint you would have a matrix with infinite costs for aliasing physical registers, and zero everywhere else. Here a picture helps:

spill al ah cl dl
spill 0 0 0 0 0
ax 0 inf inf 0 0
cx 0 0 0 inf 0
si 0 0 0 0 0

If you want a coalescing edge you’d use a matrix with negative costs for elements that allocate virtual registers to the same physical register. To represent a pairing constraint you’d put an infinite cost on any pair whose allocation doesn’t obey the pairing restriction (See [1] for several example constraints).

The overall cost of a specific solution (i.e. a register assignment) is given by summing the costs over all nodes and edges in the graph. Finding an optimal solution is NP-complete (since you can map k-colouring problems to PBQP), however the solver should find you a decent solution relatively quickly.

There’s a lot more detail on all of this in [1] and [2], so I’ll leave you to read them for more information if you’re interested (or feel free to shoot me questions via email). For now I’ll get on to the the how-to bit for LLVM.

The new header include/llvm/CodeGen/RegAllocPBQP.h includes a class called PBQPBuilder. This class represents the mapping between machine functions and PBQP problems. By extending it and implementing the build method you can add your own register allocation constraints to the PBQP problem. Just call back to the PBQPBuilder base to have the spill costs and interference constraints added for you, then iterate over the instructions in the MachineFunction and add any additional constraints for your architecture. Finally you’ll need to add your own “createRegisterAllocator” function to the passmanager in your Target, and have this function construct a RegAllocPBQP (PBQP allocator) instance with your Builder. That should do it*.

Summing up:

The PBQP allocator remains solid, generates reasonable code, and has just been generalized to allow you to add custom register allocation constraints. People who want a (comparatively) easy way to get an allocator running for architectures with unusual constraints may find it useful.

Hopefully the above can serve as a starting point for anyone who’s interested to play around with the new features, and as always I’m very happy to hear questions and comments.

Regards,
Lang.

  • This claim is only partially tested, but I’m running the allocator right now with the base constraints plus an extension which adds coalescing costs and it’s just passed CINT2006 without error. Not bad given that the coalescing extension only took a couple of hours to write.

References:

[1] Scholz, B., Eckstein, E. 2002. Register allocation for irregular architectures. In Proceedings of the Joint Conference on Languages,
Compilers and Tools for Embedded Systems (LCTES’02), ACM Press, New York, NY, USA, 139-148.

[2] Hames, L. and Scholz, B. 2006. Nearly optimal register allocation with PBQP. In Proceedings of the 7th Joint Modular Languages Conference
(JMLC’06). LNCS, vol. 4228. Springer, New York, NY, USA. 346-361.

Thanks, Lang!

Out of curiosity, how are you dealing with live range splitting and coalescing in PBQP? I know you added the LoopSplitter pass.

For linear scan in LLVM we are going in the direction of aggressive coalescing before allocation, and on-demand splitting during allocation.

I know that other allocators use no early coalescing, and coalesce during allocation.

/jakob

Hi Jakob,

Out of curiosity, how are you dealing with live range splitting and coalescing in PBQP? I know you added the LoopSplitter pass.

For linear scan in LLVM we are going in the direction of aggressive coalescing before allocation, and on-demand splitting during allocation.

I know that other allocators use no early coalescing, and coalesce during allocation.

The PBQP allocator uses the aggressive coalescer and optionally adds coalescing costs for any remaining copies to the PBQP problem (if you use -pbqp-coalescing). Unfortunately relying on PBQP’s coalescing alone is not an option - the graphs coming out of PHI elimination are very large and without aggressive coalescing to bring the size down the heuristics don’t cope well.

Currently the PBQP allocator does not do any splitting, however I expect that it should be easy to add support for on-demand splitting, and I’ll be looking into it soon-ish.

I haven’t had time to work on the LoopSplitter recently. It looks like SplitKit has subsumed LoopSplitter’s functionality. If that’s the case then I can go ahead and kill off LoopSplitter.

Cheers,
Lang.

The PBQP allocator uses the aggressive coalescer and optionally adds coalescing costs for any remaining copies to the PBQP problem (if you use -pbqp-coalescing). Unfortunately relying on PBQP's coalescing alone is not an option - the graphs coming out of PHI elimination are very large and without aggressive coalescing to bring the size down the heuristics don't cope well.

Interesting. It sounds like PBQP is pretty much in the same boat as linear scan in that regard.

Another thing to consider is that just because phi elimination generates tons of copies, it doesn't mean those copies are in the right places. An example is a register that is used a lot before and after a loop, but not inside the loop. We may want to spill that register for the duration of the loop, but phi elimination hasn't created any helpful copies near the loop.

PBQP should be able to do well with -disable-physical-join, if it can model all the hinting properly. You might even be able to model multi-register hinting. Currently, CalcSpillWeights.cpp only uses the most desirable register for a hint. PBQP can properly take a second best hint, third best hint, etc as well.

Currently the PBQP allocator does not do any splitting, however I expect that it should be easy to add support for on-demand splitting, and I'll be looking into it soon-ish.

I haven't had time to work on the LoopSplitter recently. It looks like SplitKit has subsumed LoopSplitter's functionality. If that's the case then I can go ahead and kill off LoopSplitter.

SplitKit is currently pretty broken, but I am working on fixing it. It turns out you need a good deal of logic to correctly update LiveIntervals after a split. The intention is that you can cut out any connected region of the CFG and use a new register there. Examples of split regions are:

- A loop.
- A well connected cluster of basic blocks.
- A single basic block.
- A sequence of instructions inside a basic block.

Hi Jakob,

PBQP should be able to do well with -disable-physical-join, if it can model all the hinting properly.

That’s a good point. I’ll run some tests and see what it does to allocation quality.

SplitKit is currently pretty broken, but I am working on fixing it. It turns out you need a good deal of logic to correctly update LiveIntervals after a split. The intention is that you can cut out any connected region of the CFG and use a new register there. Examples of split regions are:

  • A loop.
  • A well connected cluster of basic blocks.
  • A single basic block.
  • A sequence of instructions inside a basic block.

That sounds fantastic. I’ll start playing around with it.

LoopSplitter is currently broken too. I’d agree with your observation on updating LiveIntervals - I had similar trouble.

Cheers,
Lang.

Regards,
Lang.

  • This claim is only partially tested, but I’m running the allocator right now with the base constraints plus an extension which adds coalescing costs and it’s just passed CINT2006 without error. Not bad given that the coalescing extension only took a couple of hours to write

Following up on this footnote - The PBQPBuilderWithCoalescing builder is now in the mainline. If anyone’s looking for an example of how to add custom constraints this is a good one.

Cheers,
Lang.

Hi All,

A quick note for any PBQP users out there: As of r114626 the new PBQPBuilder based problem construction process is the default. I’ll leave the old (internal-to-the-allocator) construction process in place for a few days while everything’s being tested, but assuming there are no serious regressions it will be removed soon. You don’t need to do anything, but during this transition period if your application starts breaking you can pass -pbqp-builder=false to llc/lli to get the old construction process and see if that fixes things. If it does it would indicate a problem with the new builder code and you should file a bug asap.

Anyone who’s actually tempted to try adding custom constraints: The allocator has been moved back out of the RegAllocPBQP.h header to hide the gory details. To construct an instance you can just call createPBQPRegisterAllocator and pass your builder in.

Cheers,
Lang.