BigBlock register allocator

Hi everyone,

Quick summary:

  LLVM now has a new register allocator particularly suitable for compiling (very) large, machine-generated functions.

Longer story:

  I've recently been using LLVM in an application that involves JITing fairly large functions that have no control flow - they're just flat sequences of instructions, anywhere from 100 to 10000+ in size. (The control flow is all in the host program, which works out which monster function to call and when.)

  The default (linearscan) register allocator wasn't doing a good job, as it doesn't (yet) have live range splitting. It would quickly use all available registers and then get stuck, using only a single register (and the stack, a lot!) for hundreds or thousands of instructions at a time, greatly slowing (+bloating) the code. Luckily, I didn't have the presence of mind to actually try using the local register allocator before I ripped its guts out, turning it into RegAllocBigBlock.cpp that's now in CVS. (Had I done that, I would have been quite happy to find that the local allocator produced much better code, much faster than linearscan.)

  The good news is the new "BigBlock" allocator turns out to produce even better code than the local allocator when blocks are very large. We're talking a +10~20% speed boost on average. (If your basic blocks are small, or there's not much register pressure, you'll actually get the same code out of both local and BigBlock.)

  While BigBlock isn't (and never will be) as fast as the local allocator, it's not much slower, doesn't use much memory, and is certainly faster than linearscan. So if you're compiling very large, (probably) machine-generated blocks of straight-line code, give the Local and BigBlock allocators a try, especially if you're JITing things and compile time is important.

  If anyone has any problems with or comments on the BigBlock allocator, please drop me an email!

  Enjoy,

  Duraid

Hi Duraid,

Hi everyone,

Quick summary:

  LLVM now has a new register allocator particularly suitable for
compiling (very) large, machine-generated functions.

Congrats! Very good job!

Longer story:

  I've recently been using LLVM in an application that involves JITing

fairly large functions that have no control flow - they're just flat
sequences of instructions, anywhere from 100 to 10000+ in size. (The
control flow is all in the host program, which works out which
monster function to call and when.)

The default (linearscan) register allocator wasn't doing a good job,

as

it doesn't (yet) have live range splitting. It would quickly use all
available registers and then get stuck, using only a single register
(and the stack, a lot!) for hundreds or thousands of instructions at
a time, greatly slowing (+bloating) the code.

True. I'm working on the version of the linear scan based on Wimmer's
thesis. It supports live range splitting. I'd like to compare it with
yours. Do you have any good examples of those fairly large functions
that are just flat sequences of instructions, anywhere from 100 to
10000+ in size??? It would be nice, if you could send me those test
cases (as C or ll files). I could use it then as a basis for a
comparision and report about results.

  The good news is the new "BigBlock" allocator turns out to
produce even better code than the local allocator when blocks are

very

large. We're talking a +10~20% speed boost on average. (If your basic

blocks are small, or there's not much register pressure, you'll
actually get the same code out of both local and BigBlock.)

Do you have numbers comparing it to the current version of the LLVM's
linear scan? The win of your allocator over the linear scan should be
even better, I guess.

  While BigBlock isn't (and never will be) as fast as the local
allocator, it's not much slower, doesn't use much memory, and is
certainly faster than linearscan. So if you're compiling very large,
(probably) machine-generated blocks of straight-line code, give the
Local and BigBlock allocators a try, especially if you're JITing
things and compile time is important.

I looked at your code. And I see some things that could be
significantlty sped up, e.g.
- InsnTimes handling. I have the feeling, this map can be eliminated
completely.
- use of the VRegReadTable. The vector of read occurences can be
shortened every time, you processed the corresponding intruction. This
makes it shorter and makes searches inside this vector faster, thus
making chooseReg much faster. Probably also some other optimizations
can be applied to the chooseReg function.
- PhysRegsUseOrder - you remove some elements from the middle of this
vector in removePhysReg. This is not a very efficient operation on the
vectors, since it need to copy the tail of the vector. I think using a
list data-structure could be much more efficient for this purpose

I think these changes may significantely improve the performance of
your BigBlock register allocator. I'll try to come up with some more
concrete proposals or even patches over the week-end or next week.

-Roman

Hi Roman,

True. I'm working on the version of the linear scan based on Wimmer's
thesis. It supports live range splitting. I'd like to compare it with
yours. Do you have any good examples of those fairly large functions
that are just flat sequences of instructions, anywhere from 100 to
10000+ in size???

You can find a couple attached to bug #1512, but I'm happy to email you
more. Just in case people out there think JITing 10000-instruction functions
is crazy (I did), you should know that the Core2 processor appears to have
no real problem running *very* large functions out of L2$.

I could use it then as a basis for a
comparision and report about results.

No problem - I'd be curious to see if your allocator can improve the code
enough to justify the extra time spent.

Do you have numbers comparing it to the current version of the LLVM's
linear scan? The win of your allocator over the linear scan should be
even better, I guess.

Both local and bigblock are way, way ahead of the linearscan allocator for my
application (and I imagine others, such as large unrolled linear algebra
kernels, FFTs/codecs etc). On the code I have, performance improves
by around 50~100% using the local allocator. Using bigblock gives another
10~20% on top of that, with the gap more or less proportional to the size
of the block. For blocks with <200 instructions, local and bigblock tend
to produce more or less the same code.

I looked at your code. And I see some things that could be
significantlty sped up, e.g.
- InsnTimes handling. I have the feeling, this map can be eliminated
completely.

Absolutely, I agree it should go. I was in a hurry, and my first attempt
at eliminating it didn't work, so I just put it back. I'll get around to
killing it again at some point.

- use of the VRegReadTable. The vector of read occurences can be
shortened every time, you processed the corresponding intruction. This
makes it shorter and makes searches inside this vector faster, thus
making chooseReg much faster. Probably also some other optimizations
can be applied to the chooseReg function.

chooseReg is of course where most of the time is spent, but truth be told,
98% of the time my application spends in LLVM is *not* in the register
allocator. Or I should say *was* not - a quick patch by Evan Cheng earlier
today to some scheduler code more than doubled LLVM's speed on large
functions. However, there are still some very serious LLVM performance
issues when codegenning large basic blocks. A couple of years ago,
I clocked the LLVM JIT at roughly 10,000 cycles per byte of code
generated. Currently, it's *well* over 200,000. (Again, this is on large
functions. On small functions, the performance is presumably much better.)
Of course, back then, LLVM didn't *have* a scheduler, and even things like
instruction selection were simpler. But somewhere, some nasty speed issues
have crept in when compiling huge functions. I'll do my best to try and sift
through these over the next few weeks. (Or at least provide Evan with plenty
of test cases and beg him to fix things. :wink:

- PhysRegsUseOrder - you remove some elements from the middle of this
vector in removePhysReg. This is not a very efficient operation on the
vectors, since it need to copy the tail of the vector. I think using a
list data-structure could be much more efficient for this purpose

Actually, PhysRegsUseOrder isn't really needed for BigBlock - it's
a leftover from the local allocator. So there are even more brutal
performance fixes to be made to BigBlock, however your suggestion
applies to the Local allocator for sure!

I think these changes may significantely improve the performance of
your BigBlock register allocator. I'll try to come up with some more
concrete proposals or even patches over the week-end or next week.

Patches would be a dream - while I don't expect BigBlock to ever be
more than ~25% of total codegen time, every little bit helps. :slight_smile:

Thanks for your comments,

    Duraid

Thanks Duraid. Nice job.

I'd like to point out a couple of things.

1. The local register allocator is currently broken. :frowning: Actually it has been broken for a while (due to physical sub-regs changes in earlier passes. I am currently working on it. You probably will need to merge the fixes back once I am done.
2. Please benchmark against some more "normally sized" programs. Perhaps SPEC? I don't expect a local register allocator to win against linearscan.

Thanks,

Evan

chooseReg is of course where most of the time is spent, but truth be told,
98% of the time my application spends in LLVM is *not* in the register
allocator. Or I should say *was* not - a quick patch by Evan Cheng earlier
today to some scheduler code more than doubled LLVM's speed on large
functions. However, there are still some very serious LLVM performance
issues when codegenning large basic blocks. A couple of years ago,
I clocked the LLVM JIT at roughly 10,000 cycles per byte of code
generated. Currently, it's *well* over 200,000. (Again, this is on large
functions. On small functions, the performance is presumably much better.)
Of course, back then, LLVM didn't *have* a scheduler, and even things like
instruction selection were simpler. But somewhere, some nasty speed issues
have crept in when compiling huge functions. I'll do my best to try and sift
through these over the next few weeks. (Or at least provide Evan with plenty
of test cases and beg him to fix things. :wink:

Instruction selection is now much more complex. We have also added some more passes such as instruction scheduling, branch folding. LSR is now enabled for X86, etc.

One thing that you should do for JIT is to disable code size optimizations such as branch folding. We usually don't like codegen for static compilation and JIT to differ. However, I think it's time to refine that policy. Please add a llc option that emulate JIT codegen (except for the code emission part, of course). Perhaps something like llc -emulate-jit? That will make it easier to reproduce JIT codegen bugs.

Evan