True. I'm working on the version of the linear scan based on Wimmer's
thesis. It supports live range splitting. I'd like to compare it with
yours. Do you have any good examples of those fairly large functions
that are just flat sequences of instructions, anywhere from 100 to
10000+ in size???
You can find a couple attached to bug #1512, but I'm happy to email you
more. Just in case people out there think JITing 10000-instruction functions
is crazy (I did), you should know that the Core2 processor appears to have
no real problem running *very* large functions out of L2$.
I could use it then as a basis for a
comparision and report about results.
No problem - I'd be curious to see if your allocator can improve the code
enough to justify the extra time spent.
Do you have numbers comparing it to the current version of the LLVM's
linear scan? The win of your allocator over the linear scan should be
even better, I guess.
Both local and bigblock are way, way ahead of the linearscan allocator for my
application (and I imagine others, such as large unrolled linear algebra
kernels, FFTs/codecs etc). On the code I have, performance improves
by around 50~100% using the local allocator. Using bigblock gives another
10~20% on top of that, with the gap more or less proportional to the size
of the block. For blocks with <200 instructions, local and bigblock tend
to produce more or less the same code.
I looked at your code. And I see some things that could be
significantlty sped up, e.g.
- InsnTimes handling. I have the feeling, this map can be eliminated
Absolutely, I agree it should go. I was in a hurry, and my first attempt
at eliminating it didn't work, so I just put it back. I'll get around to
killing it again at some point.
- use of the VRegReadTable. The vector of read occurences can be
shortened every time, you processed the corresponding intruction. This
makes it shorter and makes searches inside this vector faster, thus
making chooseReg much faster. Probably also some other optimizations
can be applied to the chooseReg function.
chooseReg is of course where most of the time is spent, but truth be told,
98% of the time my application spends in LLVM is *not* in the register
allocator. Or I should say *was* not - a quick patch by Evan Cheng earlier
today to some scheduler code more than doubled LLVM's speed on large
functions. However, there are still some very serious LLVM performance
issues when codegenning large basic blocks. A couple of years ago,
I clocked the LLVM JIT at roughly 10,000 cycles per byte of code
generated. Currently, it's *well* over 200,000. (Again, this is on large
functions. On small functions, the performance is presumably much better.)
Of course, back then, LLVM didn't *have* a scheduler, and even things like
instruction selection were simpler. But somewhere, some nasty speed issues
have crept in when compiling huge functions. I'll do my best to try and sift
through these over the next few weeks. (Or at least provide Evan with plenty
of test cases and beg him to fix things.
- PhysRegsUseOrder - you remove some elements from the middle of this
vector in removePhysReg. This is not a very efficient operation on the
vectors, since it need to copy the tail of the vector. I think using a
list data-structure could be much more efficient for this purpose
Actually, PhysRegsUseOrder isn't really needed for BigBlock - it's
a leftover from the local allocator. So there are even more brutal
performance fixes to be made to BigBlock, however your suggestion
applies to the Local allocator for sure!
I think these changes may significantely improve the performance of
your BigBlock register allocator. I'll try to come up with some more
concrete proposals or even patches over the week-end or next week.
Patches would be a dream - while I don't expect BigBlock to ever be
more than ~25% of total codegen time, every little bit helps.
Thanks for your comments,