Hi Vivek
Yes.
I do not know if there is a paper on this as this is quite trivial, but IIRC Open64 register allocator does that.
Anyhow, the algo is:
Given a call graph SCC
- Allocate the function with no calls or where each callee has been allocated
- Propagate the clobbered registers to the callers of that function by updating the related regmasks on the callsites.
Repeat until no more candidate.Right direction overall. The simplest approach to this is feasible within a summer and should definitely give you good results when you have cases of hot calls with many spill/fills around it that could be eliminated.
One does not necessarily need the call graph. The compiler can do this as an opportunistic optimization. The callee collects a resource mask and the caller consumes it when it is “there”. Within a module when the callee”leaf” is compiled before the caller the information is “there”. When the call graph is available you want a bottom up walk for this optimization.
A few things to keep an eye on:
- The twist here could be that the bottom up order conflicts with the layout order, so the two optimizations would have to run independently. ( I have not looked into the layout algorithm so this might not be an actual issue here).
Layout is just the order functions reach the AsmPrinter, so you’re right that this is going to make the function output different. If we care about the order, which we may do, then we’d need to cache the data in the AsmPrinter and reorder it there somehow.
Pete Cooper Do you mean to cache function order related data in AsmPrinter ?
Yeah, exactly. So if the module has [foo, bar] in that order, but you compile them as [bar, foo] because of SCC, then you may want to somehow reorder them during the AsmPrinter so that they are emitted as [foo, bar] again.
Of course this shouldn’t matter (I can’t think of a case where it would matter), but for ease of debugging at least, it is nice to have functions emitted in the same order as they are in the IR.
Some bonus features that come from codegen on the calligraphy, and specifically having accurate regmasks and similar information:
- The X86 VZeroUpper pass should insert fewer VZeroUpper instructions before calls, and could possibly even learn that after the call the state of vzeroupper is known.
- Values in registers can be used by the callee instead of loading them.
The second one here is fun. Imagine this pseudo code:
foo:
r0 = 1000
…
retbar:
…
call foo
vreg1 = vreg2 + 1000You know which registers contain which values after the call to foo. In this case you know that the value of 1000 is available in a register already so you can avoid loading it for use in the add. You could have other values in registers too, even those which are passed in to foo. The ‘this’ pointer is the best example as its probably incredibly likely that r0 contains the this pointer after a function call which didn’t override r0 for the return.
The above mentioned case is interesting and useful, perhaps and simple analysis pass which can return a map from value to register will help.
Yeah, I think it could be interesting. Of course one of the interesting things is decided when its more profitable to not use the map. You would not, for example, choose to reserve a register containing a constant for a long time as it would almost always be cheaper to just regenerate the constant when needed. But a constant used very soon after a call may still be useful.
The this pointer example is actually related to what Quentin mentioned as a future direction here: rewriting calling conventions. If you have
int A::foo() {
return this->value;
}then you are going to have code something like
foo:
r0 = load r0, #offset_of_value
retIf the this pointer is live after the call, and it almost certainly is, then it would be better to rewrite this call to avoid clobbering r0. That is, return the this pointer in r0 and the value in r1. That could actually be done as an IR level pass too though if its deemed profitable.
Anyway, didn’t mean to distract from the immediate goals of this project. I’m excited to see the SCC code make it in tree and see what else it enables.
One more, just for fun: Inter-procedural stack allocation. That is of calls bar, bar needs 4 bytes of stack space. Instead of bar allocating 4 bytes, it adds an attribute to itself, then foo allocates 4 bytes of space at the bottom of the stack for bar to use.
Can you please provide some links to understand benefits of IP stack allocation ?
I actually don’t have any links. Its just something I thought about implementing a while ago. The main benefits I can think of are saving code size and performance as ‘bar’ in my example would not contain any stack manipulation code.
I have also write the draft proposal, I will share it through the summer of code site.
Here is the link https://docs.google.com/document/d/1DrsaFJdtxV73Zpns2bEgjATLFcWuaYMPHuvt5THLeLk/edit?usp=sharing
This is not much effective but still I would like to give it a try. Please review it quickly I have 23 hours to submit the final PDF.
I just read it. It looks good to me, although i’m not a register allocator or SCC expert, so hopefully others will have good feedback for you.
Thanks
Pete