First, thanks very much for your comments!
It is limited to X86 presently since that is the only target I have
access to at the moment.
What part of this is target dependent? Is this due to emitPrologue /
emitEpilogue being target specific?
It is target dependent (X86) at present because of the way I developed it, just using the X86 target since that is the only one on which I can test the entire (static) flow: test.c → llvm-gcc -emit-llvm → (.ll, .bc) → llc --shrink-wrap → .s → gcc test.s -o test.
I worked with other targets also, but I decided to take it as far as I could on the first go with X86.
First pass was without debugging info and with simple stack frames, in the interest of getting as much worked out as possible.
I saw the issue concerning how code gen handles placement of spill and restore code outside of entry/return blocks before I had the first test cases running, but I worked through the details using -march=x86 only.
Re: debugging info: I know about the work to change the way debugging info is handled, so I held off trying to make the shrink wrapping work with the current impl.
The main features are:
-
Placing callee saved register (CSR) spills and restores in the
CFG to tightly surround uses
so that execution paths that do not use CSRs do not pay the
spill/restore penalty.
-
Avoiding placment of spills/restores in loops: if a CSR is used
inside a loop(nest), the spills
are placed in the loop preheader, and restores are placed in
the loop exit nodes (the
successors of the loop exiting nodes).
-
Covering paths without CSR uses: e.g. if a restore is placed in
a join block, a matching spill
is added to the end of all immediate predecessor blocks that
are not reached by a spill.
Similarly for saves placed in branch blocks.
Sounds great. It would help everyone if you can show some examples code.
I am putting documented examples together from the test cases in the patch.
Since I ran into a non-trivial issue in developing this pass, I
would like to submit my implementation as a “RFC + code/tests”
rather than a typical contribution, and get people’s opinions on how
to proceed.
The issue is that the code generator assumes all spills and restores
of callee saved registers must be placed in the entry and return
blocks resp.
Shink wrapping violates this assumption, with the result that the
frame offsets computed by PEI for frame-relative operands may be
incorrect.
I am not sure how this would happen. Why would frame offsets be
affected by where these instructions are placed?
The issue is illustrated by a simple example in which a single CSR is used in one branch of a conditional. When the stack frame is laid out, the spill for this CSR is accounted for in the calculation of stack size as it should be. The stack slot for the CSR is correctly computed and everything seems fine when the MO_FrameIndex are replaced. The problem is that since the spill instruction for the CSR (e.g. pushl %esi) is moved from the entry block, the push does not happen, and the value of %esp in the entry block is not what it should be to satisfy the offsets produced by eliminateFrameIndex().
A similar situation exists for the BB into which a spill is “moved” (from the entry block): a push happens to spill the CSR on entry to the block, and now %esp is not what it should be for that block. The example below illustrates this issue:
assume:
int F(int a, int b, int c) uses one CSR in conditional branch
prologue, no shrink wrapping:
_F:
pushl %esi # spill CSR %csi, %esp -= 4 (in this case)
subl $56, %esp # create frame, %esp = %esp - 56
movl 64(%esp), %eax # fetch arg ‘a’ from entry %esp + 4
movl %eax, 52(%esp)
movl 68(%esp), %eax # fetch arg ‘b’
movl %eax, 48(%esp)
…
prologue with spill shrink-wrapped to basic block bb:
_F:
no spill of %esi, moved to bb
subl $56, %esp # create frame same as before %esp = %esp - 56
movl 64(%esp), %eax # error: ‘a’ is not at 64(%esp), it’s at 60(%esp)
movl %eax, 52(%esp)
…
The simple, ugly hack of adjusting the value by which %esp is decremented in the prologue when one or more CSR spills have been placed into other blocks takes care of the issue on this simple code (no dynamic stack realign., nor EH) on x86.
The companion hack for (non entry) MBBs into which spills have been introduced is to adjust the stack size around eliminateFrameIndex()'s for replacement of MO_FrameIndex operands.
Obviously, all of this applies only when spills are done with push/pop, which is the case on x86. I used this issue to start looking at generalizing how spills and restores are handled, before looking too closely at other targets, and developed the workaround for the initial implementation.
My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.
I think to do this right for every target is a big job. I’d like to
see prologue / epilogue related stuff be moved out of
TargetRegisterInfo. Shrink wrapping will only happen when the targets
buy-in, i.e. providing the right hooks.
Part of what I’m doing now is estimating the work, which requires going through the targets. I am not far enough along to send out a proposal. Moving pro/epi generation out of TRI, perhaps into its own “component” is one architecture I am looking at.
When is shrink wrapping happening? Is it possible to do it after CSR
spills and restores are inserted but before FI are lowered into sp /
fp +/- offset?
Shrink wrapping starts after calculateCalleeSavedRegisters(), which creates the list of CSRs used in the function. Shrink wrapping assigns MBB placements for spills and restores based on where they are used. calculateCalleeSavedRegisters() determines stack slots for the CSRs used in the function.
I don’t see an interaction between this and shrink wrapping, of have I missed something?
Finally, I realize that shrink wrapping is probably not high
priority in the larger scheme of LLVM development, so I’m not
expecting
a huge response, but any ideas are certainly welcome.
It’s actually a fairly useful optimization. It can really help a class
of functions, e.g. functions with early returns.
Quite right, it is certainly worthwhile. I could have left that comment out
The patch and a test tarball are attached. I include basic tests
that are run with the supplied Makefile.
Some comments:
- The code needs some refactoring. It’s a big chunk of code so
it’s hard to digest.
- There doesn’t seem to be a way to turn off shrink wrapping. Please
make sure it’s optional. When it’s not being done, PEI should not
require dominator, etc.
I already refactored once, but I knew it would not be enough(!), I’ll definitely do another pass. I forgot to put the analysis deps under --shrink-wrap, I will fix that and anything else that I might have left out of the option.
From what I can see this is a very good first step. I look forward to
seeing its completion.
Evan
Thanks! Likewise, and it’s a pleasure to work on.
John