Shrink Wrapping - RFC and initial implementation

Hello LLVMdev,

I have been working with LLVM for just over a year now, mainly in the area of compilation for HDLs like SystemVerilog and SystemC.

Most of this work dealt with translation to LLVM IR, representing concurrent languages with LLVM and using LLVM analyses and transforms
for compiling onto proprietary simulation acceleration hardware. All of this work used the C back end exclusively, since I wanted a transparent
and easily debuggable flow.

To learn more about the code generator, I decided to try implementing shrink wrapping, a reasonably self-contained back end transformation pass.

I now have a preliminary implemenation of shrink wrapping, done as an option to prologue/epilogue insertion under the switch --shrink-wrap.
It is limited to X86 presently since that is the only target I have access to at the moment.

The main features are:

  • Placing callee saved register (CSR) spills and restores in the CFG to tightly surround uses
    so that execution paths that do not use CSRs do not pay the spill/restore penalty.

  • Avoiding placment of spills/restores in loops: if a CSR is used inside a loop(nest), the spills
    are placed in the loop preheader, and restores are placed in the loop exit nodes (the
    successors of the loop exiting nodes).

  • Covering paths without CSR uses: e.g. if a restore is placed in a join block, a matching spill
    is added to the end of all immediate predecessor blocks that are not reached by a spill.
    Similarly for saves placed in branch blocks.

Since I ran into a non-trivial issue in developing this pass, I would like to submit my implementation as a “RFC + code/tests”
rather than a typical contribution, and get people’s opinions on how to proceed.

The issue is that the code generator assumes all spills and restores of callee saved registers must be placed in the entry and return blocks resp.
Shink wrapping violates this assumption, with the result that the frame offsets computed by PEI for frame-relative operands may be incorrect.

My limited implementation uses a workaround that adjusts the generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at several approaches, but I would like input from anyone who
has an opinion.

Finally, I realize that shrink wrapping is probably not high priority in the larger scheme of LLVM development, so I’m not expecting
a huge response, but any ideas are certainly welcome.

The patch and a test tarball are attached. I include basic tests that are run with the supplied Makefile.

Thanks,
John Mosby

shrink-wrapping.patch (42.2 KB)

test-sw.tar.gz (2.22 KB)

Hello, John

My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.

I haven't looked into the patch deep enough yet, but I have at least 2 questions:
1. How do all the stuff play with dynamic stack realignment?
2. It seems, that dwarf information about callee-saved registers is invalidated by your patch.
This means, that you won't have sane stack traces in the debugger. Unwinding won't also work.
Have you tried to compile some C++ code, which uses EH?

Hi Anton,

Thanks for your questions, that’s what I’m looking for.

Hello, John

My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.

I haven’t looked into the patch deep enough yet, but I have at least 2 questions:

  1. How do all the stuff play with dynamic stack realignment?
  2. It seems, that dwarf information about callee-saved registers is invalidated by your patch.
    This means, that you won’t have sane stack traces in the debugger. Unwinding won’t also work.
    Have you tried to compile some C++ code, which uses EH?

Integrating shrink wrapping with dynamic stack realignment, debugging info, EH (and more)

requires a more general (or more complete) way of treating callee-saved registers, and I did
not attempt to tackle this in the patch. I meant to show a starting point for this work and get
some questions coming in (working so far :-)).

I am not far along enough with the two approaches I’m looking at to give more detail. I will
have more worked out in a few days.

I’m still coming up to speed in the code generator areas, so thanks for your patience!

Again, thanks for your questions,
John

Hello LLVMdev,

I have been working with LLVM for just over a year now, mainly in the area of compilation for HDLs like SystemVerilog and SystemC.
Most of this work dealt with translation to LLVM IR, representing concurrent languages with LLVM and using LLVM analyses and transforms
for compiling onto proprietary simulation acceleration hardware. All of this work used the C back end exclusively, since I wanted a transparent
and easily debuggable flow.

Welcome to the community.

To learn more about the code generator, I decided to try implementing shrink wrapping, a reasonably self-contained back end transformation pass.

I now have a preliminary implemenation of shrink wrapping, done as an option to prologue/epilogue insertion under the switch --shrink-wrap.

Nice.

It is limited to X86 presently since that is the only target I have access to at the moment.

What part of this is target dependent? Is this due to emitPrologue / emitEpilogue being target specific?

The main features are:
  - Placing callee saved register (CSR) spills and restores in the CFG to tightly surround uses
     so that execution paths that do not use CSRs do not pay the spill/restore penalty.

  - Avoiding placment of spills/restores in loops: if a CSR is used inside a loop(nest), the spills
     are placed in the loop preheader, and restores are placed in the loop exit nodes (the
     successors of the loop _exiting_ nodes).

  - Covering paths without CSR uses: e.g. if a restore is placed in a join block, a matching spill
     is added to the end of all immediate predecessor blocks that are not reached by a spill.
     Similarly for saves placed in branch blocks.

Sounds great. It would help everyone if you can show some examples code.

Since I ran into a non-trivial issue in developing this pass, I would like to submit my implementation as a "RFC + code/tests"
rather than a typical contribution, and get people's opinions on how to proceed.

The issue is that the code generator assumes all spills and restores of callee saved registers must be placed in the entry and return blocks resp.
Shink wrapping violates this assumption, with the result that the frame offsets computed by PEI for frame-relative operands may be incorrect.

I am not sure how this would happen. Why would frame offsets be affected by where these instructions are placed?

My limited implementation uses a workaround that adjusts the generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at several approaches, but I would like input from anyone who
has an opinion.

I think to do this right for every target is a big job. I'd like to see prologue / epilogue related stuff be moved out of TargetRegisterInfo. Shrink wrapping will only happen when the targets buy-in, i.e. providing the right hooks.

When is shrink wrapping happening? Is it possible to do it after CSR spills and restores are inserted but before FI are lowered into sp / fp +/- offset?

Finally, I realize that shrink wrapping is probably not high priority in the larger scheme of LLVM development, so I'm not expecting
a huge response, but any ideas are certainly welcome.

It's actually a fairly useful optimization. It can really help a class of functions, e.g. functions with early returns.

The patch and a test tarball are attached. I include basic tests that are run with the supplied Makefile.

Some comments:

1. The code needs some refactoring. :slight_smile: It's a big chunk of code so it's hard to digest.
2. There doesn't seem to be a way to turn off shrink wrapping. Please make sure it's optional. When it's not being done, PEI should not require dominator, etc.

From what I can see this is a very good first step. I look forward to seeing its completion.

Evan

Hi Anton,

Thanks for your questions, that’s what I’m looking for.

Hello, John

My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.

I haven’t looked into the patch deep enough yet, but I have at least 2 questions:

  1. How do all the stuff play with dynamic stack realignment?
  2. It seems, that dwarf information about callee-saved registers is invalidated by your patch.
    This means, that you won’t have sane stack traces in the debugger. Unwinding won’t also work.
    Have you tried to compile some C++ code, which uses EH?

Integrating shrink wrapping with dynamic stack realignment, debugging info, EH (and more)

requires a more general (or more complete) way of treating callee-saved registers, and I did
not attempt to tackle this in the patch. I meant to show a starting point for this work and get
some questions coming in (working so far :-)).

I think for step 1 PEI should not attempt shrink wrapping when dynamic stack realignment or EH is required. Debug info is a different story. We are in the process of eliminating debug specific instructions (i.e. rely completely on DebugLoc on machine instructions). Once that’s done, it should just work.

Evan

First, thanks very much for your comments!

It is limited to X86 presently since that is the only target I have
access to at the moment.

What part of this is target dependent? Is this due to emitPrologue /
emitEpilogue being target specific?

It is target dependent (X86) at present because of the way I developed it, just using the X86 target since that is the only one on which I can test the entire (static) flow: test.c → llvm-gcc -emit-llvm → (.ll, .bc) → llc --shrink-wrap → .s → gcc test.s -o test.

I worked with other targets also, but I decided to take it as far as I could on the first go with X86.

First pass was without debugging info and with simple stack frames, in the interest of getting as much worked out as possible.
I saw the issue concerning how code gen handles placement of spill and restore code outside of entry/return blocks before I had the first test cases running, but I worked through the details using -march=x86 only.

Re: debugging info: I know about the work to change the way debugging info is handled, so I held off trying to make the shrink wrapping work with the current impl.

The main features are:

  • Placing callee saved register (CSR) spills and restores in the
    CFG to tightly surround uses
    so that execution paths that do not use CSRs do not pay the
    spill/restore penalty.

  • Avoiding placment of spills/restores in loops: if a CSR is used
    inside a loop(nest), the spills
    are placed in the loop preheader, and restores are placed in
    the loop exit nodes (the
    successors of the loop exiting nodes).

  • Covering paths without CSR uses: e.g. if a restore is placed in
    a join block, a matching spill
    is added to the end of all immediate predecessor blocks that
    are not reached by a spill.
    Similarly for saves placed in branch blocks.

Sounds great. It would help everyone if you can show some examples code.

I am putting documented examples together from the test cases in the patch.

Since I ran into a non-trivial issue in developing this pass, I
would like to submit my implementation as a “RFC + code/tests”
rather than a typical contribution, and get people’s opinions on how
to proceed.

The issue is that the code generator assumes all spills and restores
of callee saved registers must be placed in the entry and return
blocks resp.
Shink wrapping violates this assumption, with the result that the
frame offsets computed by PEI for frame-relative operands may be
incorrect.

I am not sure how this would happen. Why would frame offsets be
affected by where these instructions are placed?

The issue is illustrated by a simple example in which a single CSR is used in one branch of a conditional. When the stack frame is laid out, the spill for this CSR is accounted for in the calculation of stack size as it should be. The stack slot for the CSR is correctly computed and everything seems fine when the MO_FrameIndex are replaced. The problem is that since the spill instruction for the CSR (e.g. pushl %esi) is moved from the entry block, the push does not happen, and the value of %esp in the entry block is not what it should be to satisfy the offsets produced by eliminateFrameIndex().
A similar situation exists for the BB into which a spill is “moved” (from the entry block): a push happens to spill the CSR on entry to the block, and now %esp is not what it should be for that block. The example below illustrates this issue:

assume:
int F(int a, int b, int c) uses one CSR in conditional branch

prologue, no shrink wrapping:

_F:
pushl %esi # spill CSR %csi, %esp -= 4 (in this case)
subl $56, %esp # create frame, %esp = %esp - 56
movl 64(%esp), %eax # fetch arg ‘a’ from entry %esp + 4
movl %eax, 52(%esp)
movl 68(%esp), %eax # fetch arg ‘b’

movl %eax, 48(%esp)

prologue with spill shrink-wrapped to basic block bb:

_F:

no spill of %esi, moved to bb

subl $56, %esp # create frame same as before %esp = %esp - 56
movl 64(%esp), %eax # error: ‘a’ is not at 64(%esp), it’s at 60(%esp)
movl %eax, 52(%esp)

The simple, ugly hack of adjusting the value by which %esp is decremented in the prologue when one or more CSR spills have been placed into other blocks takes care of the issue on this simple code (no dynamic stack realign., nor EH) on x86.

The companion hack for (non entry) MBBs into which spills have been introduced is to adjust the stack size around eliminateFrameIndex()'s for replacement of MO_FrameIndex operands.

Obviously, all of this applies only when spills are done with push/pop, which is the case on x86. I used this issue to start looking at generalizing how spills and restores are handled, before looking too closely at other targets, and developed the workaround for the initial implementation.

My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.

I think to do this right for every target is a big job. I’d like to
see prologue / epilogue related stuff be moved out of
TargetRegisterInfo. Shrink wrapping will only happen when the targets
buy-in, i.e. providing the right hooks.

Part of what I’m doing now is estimating the work, which requires going through the targets. I am not far enough along to send out a proposal. Moving pro/epi generation out of TRI, perhaps into its own “component” is one architecture I am looking at.

When is shrink wrapping happening? Is it possible to do it after CSR
spills and restores are inserted but before FI are lowered into sp /
fp +/- offset?

Shrink wrapping starts after calculateCalleeSavedRegisters(), which creates the list of CSRs used in the function. Shrink wrapping assigns MBB placements for spills and restores based on where they are used. calculateCalleeSavedRegisters() determines stack slots for the CSRs used in the function.
I don’t see an interaction between this and shrink wrapping, of have I missed something?

Finally, I realize that shrink wrapping is probably not high
priority in the larger scheme of LLVM development, so I’m not
expecting
a huge response, but any ideas are certainly welcome.

It’s actually a fairly useful optimization. It can really help a class
of functions, e.g. functions with early returns.

Quite right, it is certainly worthwhile. I could have left that comment out :slight_smile:

The patch and a test tarball are attached. I include basic tests
that are run with the supplied Makefile.

Some comments:

  1. The code needs some refactoring. :slight_smile: It’s a big chunk of code so
    it’s hard to digest.
  2. There doesn’t seem to be a way to turn off shrink wrapping. Please
    make sure it’s optional. When it’s not being done, PEI should not
    require dominator, etc.

I already refactored once, but I knew it would not be enough(!), I’ll definitely do another pass. I forgot to put the analysis deps under --shrink-wrap, I will fix that and anything else that I might have left out of the option.

From what I can see this is a very good first step. I look forward to
seeing its completion.

Evan

Thanks! Likewise, and it’s a pleasure to work on.

John

First, thanks very much for your comments!

It is limited to X86 presently since that is the only target I have
access to at the moment.

What part of this is target dependent? Is this due to emitPrologue /
emitEpilogue being target specific?

It is target dependent (X86) at present because of the way I developed it, just using the X86 target since that is the only one on which I can test the entire (static) flow: test.c → llvm-gcc -emit-llvm → (.ll, .bc) → llc --shrink-wrap → .s → gcc test.s -o test.

I worked with other targets also, but I decided to take it as far as I could on the first go with X86.

First pass was without debugging info and with simple stack frames, in the interest of getting as much worked out as possible.
I saw the issue concerning how code gen handles placement of spill and restore code outside of entry/return blocks before I had the first test cases running, but I worked through the details using -march=x86 only.

Ok.

Re: debugging info: I know about the work to change the way debugging info is handled, so I held off trying to make the shrink wrapping work with the current impl.

The main features are:

  • Placing callee saved register (CSR) spills and restores in the
    CFG to tightly surround uses
    so that execution paths that do not use CSRs do not pay the
    spill/restore penalty.

  • Avoiding placment of spills/restores in loops: if a CSR is used
    inside a loop(nest), the spills
    are placed in the loop preheader, and restores are placed in
    the loop exit nodes (the
    successors of the loop exiting nodes).

  • Covering paths without CSR uses: e.g. if a restore is placed in
    a join block, a matching spill
    is added to the end of all immediate predecessor blocks that
    are not reached by a spill.
    Similarly for saves placed in branch blocks.

Sounds great. It would help everyone if you can show some examples code.

I am putting documented examples together from the test cases in the patch.

Yes, please. I think that will help to encourage discussion.

Since I ran into a non-trivial issue in developing this pass, I
would like to submit my implementation as a “RFC + code/tests”
rather than a typical contribution, and get people’s opinions on how
to proceed.

The issue is that the code generator assumes all spills and restores
of callee saved registers must be placed in the entry and return
blocks resp.
Shink wrapping violates this assumption, with the result that the
frame offsets computed by PEI for frame-relative operands may be
incorrect.

I am not sure how this would happen. Why would frame offsets be
affected by where these instructions are placed?

The issue is illustrated by a simple example in which a single CSR is used in one branch of a conditional. When the stack frame is laid out, the spill for this CSR is accounted for in the calculation of stack size as it should be. The stack slot for the CSR is correctly computed and everything seems fine when the MO_FrameIndex are replaced. The problem is that since the spill instruction for the CSR (e.g. pushl %esi) is moved from the entry block, the push does not happen, and the value of %esp in the entry block is not what it should be to satisfy the offsets produced by eliminateFrameIndex().
A similar situation exists for the BB into which a spill is “moved” (from the entry block): a push happens to spill the CSR on entry to the block, and now %esp is not what it should be for that block. The example below illustrates this issue:

Ok. Then PEI should use store / load instead of push and pop and esp adjustment should be adjusted accordingly.

assume:
int F(int a, int b, int c) uses one CSR in conditional branch

prologue, no shrink wrapping:

_F:
pushl %esi # spill CSR %csi, %esp -= 4 (in this case)
subl $56, %esp # create frame, %esp = %esp - 56
movl 64(%esp), %eax # fetch arg ‘a’ from entry %esp + 4
movl %eax, 52(%esp)
movl 68(%esp), %eax # fetch arg ‘b’

movl %eax, 48(%esp)

prologue with spill shrink-wrapped to basic block bb:

_F:

no spill of %esi, moved to bb

subl $56, %esp # create frame same as before %esp = %esp - 56
movl 64(%esp), %eax # error: ‘a’ is not at 64(%esp), it’s at 60(%esp)
movl %eax, 52(%esp)

The simple, ugly hack of adjusting the value by which %esp is decremented in the prologue when one or more CSR spills have been placed into other blocks takes care of the issue on this simple code (no dynamic stack realign., nor EH) on x86.

The companion hack for (non entry) MBBs into which spills have been introduced is to adjust the stack size around eliminateFrameIndex()'s for replacement of MO_FrameIndex operands.

Obviously, all of this applies only when spills are done with push/pop, which is the case on x86. I used this issue to start looking at generalizing how spills and restores are handled, before looking too closely at other targets, and developed the workaround for the initial implementation.

Spills don’t have to be done with push and pop. One possible approach to take is to perform them with store and load. Then later on PEI will optimize (some of) them into push and pop.

My limited implementation uses a workaround that adjusts the
generation of prologue code and the frame indices used by
the target eliminateFrameIndex() when necessary. I am looking at
several approaches, but I would like input from anyone who
has an opinion.

I think to do this right for every target is a big job. I’d like to
see prologue / epilogue related stuff be moved out of
TargetRegisterInfo. Shrink wrapping will only happen when the targets
buy-in, i.e. providing the right hooks.

Part of what I’m doing now is estimating the work, which requires going through the targets. I am not far enough along to send out a proposal. Moving pro/epi generation out of TRI, perhaps into its own “component” is one architecture I am looking at.

It’s not necessary to update all the targets at once. We can ask targets to opt in to take advantage of shrink wrapping.

When is shrink wrapping happening? Is it possible to do it after CSR
spills and restores are inserted but before FI are lowered into sp /
fp +/- offset?

Shrink wrapping starts after calculateCalleeSavedRegisters(), which creates the list of CSRs used in the function. Shrink wrapping assigns MBB placements for spills and restores based on where they are used. calculateCalleeSavedRegisters() determines stack slots for the CSRs used in the function.
I don’t see an interaction between this and shrink wrapping, of have I missed something?

No.

Finally, I realize that shrink wrapping is probably not high
priority in the larger scheme of LLVM development, so I’m not
expecting
a huge response, but any ideas are certainly welcome.

It’s actually a fairly useful optimization. It can really help a class
of functions, e.g. functions with early returns.

Quite right, it is certainly worthwhile. I could have left that comment out :slight_smile:

The patch and a test tarball are attached. I include basic tests
that are run with the supplied Makefile.

Some comments:

  1. The code needs some refactoring. :slight_smile: It’s a big chunk of code so
    it’s hard to digest.
  2. There doesn’t seem to be a way to turn off shrink wrapping. Please
    make sure it’s optional. When it’s not being done, PEI should not
    require dominator, etc.

I already refactored once, but I knew it would not be enough(!), I’ll definitely do another pass. I forgot to put the analysis deps under --shrink-wrap, I will fix that and anything else that I might have left out of the option.

Ok, thanks.

From what I can see this is a very good first step. I look forward to
seeing its completion.

Evan

Thanks! Likewise, and it’s a pleasure to work on.

Good to hear.

Thanks,

Evan

Obviously, all of this applies only when spills are done with push/pop, which is the case on x86. I used this issue to start looking at generalizing how spills and restores are handled, before looking too closely at other targets, and developed the workaround for the initial implementation.

Spills don’t have to be done with push and pop. One possible approach to take is to perform them with store and load. Then later on PEI will optimize (some of) them into push and pop.

I am rewriting to use stores/loads via storeRegToStackSlot(), loadRegFromStackSlot(), which will obviate the stack adjustment. I did it this way in an early version, but the architecture is to give the target a chance to generate the spill(restore) code itself, so I conformed to that.
As you point out, these can selectively be replaced with push/pop later.

Part of what I’m doing now is estimating the work, which requires going through the targets. I am not far enough along to send out a proposal. Moving pro/epi generation out of TRI, perhaps into its own “component” is one architecture I am looking at.

It’s not necessary to update all the targets at once. We can ask targets to opt in to take advantage of shrink wrapping.

That sounds good. I now have a better handle on the differences between the targets: XCore handles frame moves in its spillCalleeSavedRegister(), the ARM code for this is very straightforward.

I hope to have an updated patch done tomorrow, with doc. and explicated examples.

Thanks,
John

Here is an updated patch for shrink wrapping with:

  • spills/restores done with stack slot stores/loads
  • stack adjustment removed
  • refactoring (but still in need of more)
  • spill/restore insertion code unified with spill/restore placement code

Documentation available here illustrates shrink wrapping with loops and discusses a situation in which the pass would be more effective by splitting edges in the Machine CFG (similar to breaking crit. edges in the CFG).

Test cases are attached.

Thanks,
John

shrink-wrapping.P2.patch (39.6 KB)

test-sw.tar.gz (2.22 KB)

Hi John,

It looks pretty good. Thanks for working on this. Some comments:

  1. Some of the functions that you introduced, e.g. stringifyCSRegSet probably ought to be “static” and ifdef’ed out when NDEBUG is defined.

    • // DEBUG
  • if (! MBB->empty() && ! CSRUsed[MBB].intersects(restore)) {
  • MachineInstr* MI = BeforeI;
  • DOUT << "adding restore after ";
  • DEBUG(MI->dump());
  • } else {
  • DOUT << "adding restore to beginning of "
  • << getBasicBlockName(MBB) << “\n”;
  • }
  • // DEBUG

Code like this should also be inside ifndef NDEBUG.

  1. It can still use more refactoring. :slight_smile:

  2. clearSets().
    It’s not clear what sets it’s clearing. Perhaps name it something like clearShrinkWrapData()?

5
+void PEI::placeSpillsAndRestores(MachineFunction &Fn) {

Hi Evan,

Thanks very much for the review, I am implementing your suggestions today and will have the next patch together this weekend.

A few questions/comments:

  1. Some of the functions that you introduced, e.g. stringifyCSRegSet probably ought to be “static” and ifdef’ed out when NDEBUG is defined.

    • // DEBUG
  • if (! MBB->empty() && ! CSRUsed[MBB].intersects(restore)) {
  • MachineInstr* MI = BeforeI;
  • DOUT << "adding restore after ";
  • DEBUG(MI->dump());
  • } else {
  • DOUT << "adding restore to beginning of "
  • << getBasicBlockName(MBB) << “\n”;
  • }
  • // DEBUG

Code like this should also be inside ifndef NDEBUG.

  1. It can still use more refactoring. :slight_smile:

  2. clearSets().
    It’s not clear what sets it’s clearing. Perhaps name it something like clearShrinkWrapData()?

Agreed in toto, I will refactor further and try to get all remaining cleanups into the next patch.

5
+void PEI::placeSpillsAndRestores(MachineFunction &Fn) {

The shrink wrap version probably should go to its own function. Otherwise, it should exit early when the non-shrink wrapping version is done. That reduces nesting and please those of us who are a bit picky.

I originally wrote it this way (separate functions). I then refactored the code to unify the two behaviors around the idea of placing spills/restores. I think the versions need to be separate for the reasons you state but the placement idea should be retained. I will reintroduce the separation.

  • // Save entry block, return blocks.
  • if (MBB->pred_size() == 0)
  • entryBlock = MBB;

Entry block is the Fn.front().

  • if (!MBB->empty() && MBB->back().getDesc().isReturn())
  • returnBlocks.push_back(MBB);

PEI::insertPrologEpilogCode also traverse MBBs and get at return blocks. So these probably ought to be shared, i.e. returnBlocks should be an ivar and determined early in PEI.

Absolutely (I knew Fn.front() was what I wanted but didn’t go back and fix things). I am implementing these.

  • for (MachineBasicBlock::iterator I = MBB->begin(); I != MBB->end(); ++I) {
  • for (unsigned i = 0, e = CSI.size(); i != e; ++i) {
  • unsigned Reg = CSI[i].getReg();
  • // If instruction I reads or modifies Reg, add it to UsedCSRegs,
  • // CSRUsed map for the current block.
  • if (I->readsRegister(Reg, TRI) || I->modifiesRegister(Reg, TRI)) {

readsRegister and modifiesRegister both scan the operands. It’s probably better to manually walk through the operands.

Ok, that helps. I read the code but could not decide which way to do it at first.

  • // If all uses of CSRegs are in the entry block, there is nothing
  • // to shrink wrap: spills go in entry block, restores go in exiting
  • // blocks, turn off shrink wrapping.
  • if (allCSRUsesInEntryBlock) {
  • ShrinkWrapping = false;
  • DOUT << “All uses of CSRegs are in entry block, nothing to do.\n”;
  • }
  • // If we have decided not to shrink wrap, just return now.
  • if (! ShrinkWrapping)
  • return true;

Why not just return inside if (allCSRUsesInEntryBlock)?

ARGHHH, I thought I simplified that before cutting the patch.

+bool PEI::calculateUsedAnticAvail(MachineFunction &Fn) {

  • // Calculate AnticIn, AnticOut using post-order traversal of MCFG.

  • for (po_iterator<MachineBasicBlock*>

  • MBBI = po_begin(Fn.getBlockNumbered(0)),

  • MBBE = po_end(Fn.getBlockNumbered(0)); MBBI != MBBE; ++MBBI) {

  • MachineBasicBlock* MBB = *MBBI;

  • // Calculate Avail{In,Out} via top-down walk of Machine dominator tree.

  • for (df_iterator<MachineDomTreeNode*> DI = df_begin(DT.getRootNode()),

  • E = df_end(DT.getRootNode()); DI != E; ++DI) {

Later in

+/// placeSpillsAndRestores - decide which MBBs need spills, restores
+/// of CSRs.
+///
+void PEI::placeSpillsAndRestores(MachineFunction &Fn) {

  • // Calculate CSRRestore using post-order traversal of Machine-CFG.
  • for (po_iterator<MachineBasicBlock*>
  • MBBI = po_begin(Fn.getBlockNumbered(0)),
  • MBBE = po_end(Fn.getBlockNumbered(0)); MBBI != MBBE; ++MBBI) {

This seem to be doing traversal at least one too many times? Can this be improved?

I started to reduce the traversals, then decided to work on edge splitting because I believe it may be needed to finish shrink wrapping.

I will return to that work and see if I can reduce the traversals, which for this approach (computing Antic, Avail) will decrease the constant factor in the runtime bound, which is linear in the size of the Machine IR.

  1. Can you explain a bit more about AnticIn, AvailIn, etc.?

I am working on a document, currently hosted at github, which will present the details of the implementation, examples, etc.

I looked at two approaches to determine spill/restore placements:

  1. Try to use live intervals of CSRs that might be available when PEI runs.
    The idea here is that each CSR used in a function will have one or more
    defs which dominate one or more uses. Live intervals might lead me to
    the MBBs in which to place spills/restores.

  2. Use “anticipatibility” (Antic{In,Out} sets) to find the points from which all
    outgoing paths contain defs or uses of a CSR, and use “availability”
    (Avail{In,Out} sets) to find the points such that all incoming paths contain
    defs or uses of a CSR. We place a spill for a CSR at the earliest point
    leading to a sequence of uses (a contiguous set of blocks containing uses),
    so a block B will get a spill for CSR R if R is anticipatable at B and not
    anticipatable at any predecessor of B. If R is used and redefined in a block,
    we have to avoid placing another spill in that block, (it was spilled earlier),
    so in addition to the above condition, R must not be available at B.
    Determining restore placement is the mirror image of spill placement.

I went with approach 2 despite the apparent complexity because the data flow
info is actually straightforward to compute, and I did not have to first synthesize
LiveIntervals (read a ton of code) to get the pass working. I am putting this information
into my temp. wiki page in hopes of getting it into the dev wiki when that is available.

I am now looking at live intervals in connection with RA and code motion (other possible projects),
and am trying to answer my question of whether live intervals could help shrink wrapping.

Let me know if you think using live interval info would be worth investigating for shrink wrapping.

Let’s worry about edge splitting for a later time. :slight_smile:

Agreed. I am still working through the mechanics to understand how to do it and ramifications.

  1. After the code is cleaned up, we should consider checking it in and try it out as llcbeta. Do you have any idea of its compile time impact?

Thanks,

Evan

I’m working on characterizing the runtime and vm overhead, I don’t yet have a detailed picture.
My plan is to do the cleanups, put together a few larger test cases, go back and run regressions,
then the test suite. With the larger focussed test cases, I will get usable numbers for compile times, and
the test suite will extend the coverage.
Please let me know if there is a simpler or more standard way to tackle this for a new pass.

What about EH and shrink wrapping? Should I disable shrink wrapping in EH contexts?

I have held off looking at maintaining debug info integrity, let me know if I should look at that or if it can wait a bit.

Thanks again,
John

Here is the latest shrink wrapping patch, with fixes for issues identified by Evan.

I am including a few small additions/fixes to include/llvm/ADT/{SparseBitVector,DepthFirstIterator}.h.

Files:
include/llvm/ADT/DepthFirstIterator.h

include/llvm/ADT/SparseBitVector.h
lib/CodeGen/PrologEpilogInserter.cpp

Evan, let me know how it looks when you get a chance.

Thanks much,
John

shrink-wrapping.P3.patch (40.8 KB)

I started to reduce the traversals, then decided to work on edge splitting because I believe it may be needed to finish shrink wrapping.

Hmm. I don’t think edge splitting would be required for correctness, right? There is always a common predecessor / successor. For the first pass, we should not be shooting to optimal solution.

I will return to that work and see if I can reduce the traversals, which for this approach (computing Antic, Avail) will decrease the constant factor in the runtime bound, which is linear in the size of the Machine IR.

  1. Can you explain a bit more about AnticIn, AvailIn, etc.?

I am working on a document, currently hosted at github, which will present the details of the implementation, examples, etc.

I looked at two approaches to determine spill/restore placements:

  1. Try to use live intervals of CSRs that might be available when PEI runs.
    The idea here is that each CSR used in a function will have one or more
    defs which dominate one or more uses. Live intervals might lead me to
    the MBBs in which to place spills/restores.

  2. Use “anticipatibility” (Antic{In,Out} sets) to find the points from which all
    outgoing paths contain defs or uses of a CSR, and use “availability”
    (Avail{In,Out} sets) to find the points such that all incoming paths contain
    defs or uses of a CSR. We place a spill for a CSR at the earliest point
    leading to a sequence of uses (a contiguous set of blocks containing uses),
    so a block B will get a spill for CSR R if R is anticipatable at B and not
    anticipatable at any predecessor of B. If R is used and redefined in a block,
    we have to avoid placing another spill in that block, (it was spilled earlier),
    so in addition to the above condition, R must not be available at B.
    Determining restore placement is the mirror image of spill placement.

I went with approach 2 despite the apparent complexity because the data flow
info is actually straightforward to compute, and I did not have to first synthesize
LiveIntervals (read a ton of code) to get the pass working. I am putting this information
into my temp. wiki page in hopes of getting it into the dev wiki when that is available.

I think it’s the right choice.

I am now looking at live intervals in connection with RA and code motion (other possible projects),
and am trying to answer my question of whether live intervals could help shrink wrapping.

Let me know if you think using live interval info would be worth investigating for shrink wrapping.

The various passes currently do not compute / update live intervals for fixed stack slots so it’s not appropriate for this. That’s why the stack slot coloring pass does not color those slots. It would be a nice enhancement to add (but for a different reason). :slight_smile:

Let’s worry about edge splitting for a later time. :slight_smile:

Agreed. I am still working through the mechanics to understand how to do it and ramifications.

  1. After the code is cleaned up, we should consider checking it in and try it out as llcbeta. Do you have any idea of its compile time impact?

Thanks,

Evan

I’m working on characterizing the runtime and vm overhead, I don’t yet have a detailed picture.
My plan is to do the cleanups, put together a few larger test cases, go back and run regressions,
then the test suite. With the larger focussed test cases, I will get usable numbers for compile times, and
the test suite will extend the coverage.
Please let me know if there is a simpler or more standard way to tackle this for a new pass.

I would just run the test suite once the code is cleaned up. There are enough tests to give us a good idea about the compile time / run time impact. Since shrink wrapping will be guarded by a command line option, you can just run the test suite with ENABLE_LLCBETA. It will report everything we need to know.

What about EH and shrink wrapping? Should I disable shrink wrapping in EH contexts?

I am not sure. If tests using EH fails, we can just disable shrink wrapping for functions with EH.

I have held off looking at maintaining debug info integrity, let me know if I should look at that or if it can wait a bit.

It’s not an immediate problem since -O0 -g means “fast” codegen and shrink wrapping is not run. We can worry about this later.

Thanks,

Evan

Hi John.

I am putting this information
into my temp. wiki page in hopes of getting it into the dev wiki when
that is available.

The dev wiki is up at its temporary name http://google2.osuosl.org/wiki/.

Feel free to dump your stuff on there.

Excellent, thanks very much.

2009/3/17 Evan Cheng <echeng@apple.com>

I started to reduce the traversals, then decided to work on edge splitting because I believe it may be needed to finish shrink wrapping.

Hmm. I don’t think edge splitting would be required for correctness, right? There is always a common predecessor / successor. For the first pass, we should not be shooting to optimal solution.

That’s correct, edge splitting is not really required. I have it as a side project for the purpose of learning more about the design of the Machine layer.

Let me know if you think using live interval info would be worth investigating for shrink wrapping.

The various passes currently do not compute / update live intervals for fixed stack slots so it’s not appropriate for this. That’s why the stack slot coloring pass does not color those slots. It would be a nice enhancement to add (but for a different reason). :slight_smile:

Thanks for the confirmation. I figured out about fixed stack slots not participating in live intervals earlier but I was not certain if it was available via some other means. Since shrink wrapping is more related to PRE (applied to loads/stores), it makes more sense to do the simple analysis anyway.

I’m working on characterizing the runtime and vm overhead, I don’t yet have a detailed picture.
My plan is to do the cleanups, put together a few larger test cases, go back and run regressions,
then the test suite. With the larger focussed test cases, I will get usable numbers for compile times, and
the test suite will extend the coverage.
Please let me know if there is a simpler or more standard way to tackle this for a new pass.

I would just run the test suite once the code is cleaned up. There are enough tests to give us a good idea about the compile time / run time impact. Since shrink wrapping will be guarded by a command line option, you can just run the test suite with ENABLE_LLCBETA. It will report everything we need to know.

Thanks, I was going to ask how ask about llc beta. I will still be adding some stats so I can see when and what it reduces.

What about EH and shrink wrapping? Should I disable shrink wrapping in EH contexts?

I am not sure. If tests using EH fails, we can just disable shrink wrapping for functions with EH.

I have held off looking at maintaining debug info integrity, let me know if I should look at that or if it can wait a bit.

It’s not an immediate problem since -O0 -g means “fast” codegen and shrink wrapping is not run. We can worry about this later.

I will be testing with some small EH examples (which Anton recommended in the beginning) and will take care of disabling it (per-function) if this proves to be necessary.

Thank you very much for the feedback!

John

FWIW, I applied these changes (after wrapping to 80 cols), thanks John!
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20090316/075504.html

-Chris

Hi John,

Unless anyone else has any comments. I recommend you check in the PEI patches so we can start testing it as llcbeta. Chris, should I commit the patch for John or can you grant him commit access (since part of his patch has already been committed)?

Evan

2009/3/23 Evan Cheng <echeng@apple.com>

Hi John,

Unless anyone else has any comments. I recommend you check in the PEI patches so we can start testing it as llcbeta. Chris, should I commit the patch for John or can you grant him commit access (since part of his patch has already been committed)?

Evan

Hi Evan, Chris,

I’ll go ahead and commit it as soon as I have access, if that is the way to proceed.

Thanks,
John