RFC: Setting MachineInstr flags through storeRegToStackSlot

## Problem description

One of the responsibilities of a target's implementation of
TargetFrameLowering::emitPrologue is to set the frame pointer (if needed).
Typically, the frame pointer will be stored to the stack just like the other
callee-saved registers, and emitPrologue must insert the instruction to change
its value after it was stored to the stack. Mips does this by looking at the
number of callee-saved registers spilled for this function and skipping over
that many instructions. Other backends such as AArch64 or X86 will skip over
all instruction marked with the MachineInstr::FrameSetup flag[1].

From my point of view, skipping over all FrameSetup instructions sounds like

the more robust way to go about things. I haven't fully traced through where
the flag is used in LLVM, but it does seem that DwarfDebug::findPrologueEndLoc
will return the wrong result if the flag isn't consistently set on frame setup
code. The problem is that unless you override
TargetFrameLowering::spillCalleeSavedRegisters, then PrologEpilogInserter will
just spill callee saved registers through a series of calls to
storeRegToStackSlot. This is fine, except storeRegToStackSlot won't set the
FrameSetup flag.

[1]: Actually X86 will only skip over PUSH instructions with the FrameSetup
flag. Is this extra condition necessary?

## Potential solutions

1) Keep the status quo, but modify PrologEpilogInserter so it will mark the
last inserted instruction with FrameSetup over calling storeRegToStackSlot.
This is exactly what X86 does in its spillCalleeSavedRegisters implementation.
This does make the assumption that storeRegtoStackSlot only inserts a single
instruction. From what I can see, that is true for all in-tree backends.
Arguably, backends like Mips could be improved by modifying their
spillCalleeSavedRegisters to set the FrameSetup flag as well.

2) Like the above, but modify storeRegToStackSlot to return the number of
MachineInstr inserted and use that value when marking instructions as
FrameSetup. This is more invasive, as it will affect all in tree and
out-of-tree backends.

3) Add a MachineInstr::MIFlag parameter to storeRegToStackSlot. The
storeRegToStackSlot implementation is responsible for applying this flag to
all inserted instructions. This could be defaulted to MachineInstr::NoFlags,
to minimise changes to function callers, although that defaulted value would
have to be replicated in all FooTgtInstrInfo.h headers.

For either 2) or 3), the changes are simple, it's just they will affect all
backends. I'm very happy to implement the patch, but thought it was worth
asking feedback. I'm CCing in Tim+Craig+Simon as AArch64/X86/Mips code owners,
as well as Eric as debug info code owner.

Any feedback is very welcome.

Thanks,

Alex

I should meant to say, the exact same issue exists for the
MachineInstr::FrameDestroy flag, emitEpilogue, and
loadRegFromStackSlot. The same resolution should be used for this case
as well.

Option 2) that I mentioned isn't quite necessary, callers can work out
the number of MachineInstrs emitted by looking at the change in the
size of the MachineBasicBlock. This does have the disadvantage of
forcing all backends using {storeRegTo,loadRegFrom}StackSlot in
FooTgtFrameLowering to replicate the necessary loops to mark the
inserted instructions.

Best,

Alex

Hi Alex,

I believe solution #3 is the one that makes sense the more sense.

Cheers,
-Quentin

Alex,

Solution 3 does seem like the simplest and most direct solution.

Thanks,
Simon

That's my preference too. I've posted an illustrative patch at
https://reviews.llvm.org/D30115. This would need a little cleanup
before merging, and should probably go hand in hand with matching
changes to loadFromFromStackSlot.

Best,

Alex

Can someone familiar with debug info comment on whether it matters to have the FrameSetup flag on the callee save spills? We could have a smart spill or shrink wrapping algorithm that delays the callee saves to a later point in the program while executing non-prologue code first.

Is this maybe just meant to indicate the point at which the stack and possible frame/base pointers are setup? Saving the callee saves should not necessary be part of that, it probably is on X86 where the pushs/pops are part of getting the final value of the stack pointer in a function without frame pointer, but for the case where you use storeRegToStackSlot() I'm not sure we even need to mark them as FrameSetup...

- Matthias

I'm not that familiar with debug info, but FWIW, Hexagon doesn't set that flag (or FrameDestroy) on anything and the DWARF info works fine (for the purposes of exception handling).

-Krzysztof

You make a good point, better understanding the importance (or lack of
importance) of setting the FrameSetup and FrameDestroy flags for debug
would be useful. The flags are useful for finding the end of the
callee save spills in emitPrologue and the beginning of the callee
save restored in emitEpilogue, but if it's not otherwise useful to
have them set that problem could be solved in other ways - for
instance passing these locations as arguments to
emitPrologue/emitEpilogue.

You make a good point about shrink wrapping -
DwarfDebug::findPrologueEndLoc seems to be one of the main places the
FrameSetup flag is checked. If shrink wrapping is applied, the frame
setup may not take place in the entry BB. The function could more
accurately described as findFirstNonPrologueLoc. For the
non-shrinkwrap case, is there any meaningful drawback for the
generated debug info if findPrologueEndLoc returns where callee saved
registers are spilled (like on MIPS) rather than the first
non-FrameSetup instruction after the spills (AArch64, X86)?

Best,

Alex

From: mbraun@apple.com [mailto:mbraun@apple.com]
Sent: Friday, February 17, 2017 3:15 PM
To: Alex Bradbury
Cc: llvm-dev; Adrian Prantl; Eric Christopher; Robinson, Paul
Subject: Re: [llvm-dev] RFC: Setting MachineInstr flags through
storeRegToStackSlot

Can someone familiar with debug info comment on whether it matters to have
the FrameSetup flag on the callee save spills? We could have a smart spill
or shrink wrapping algorithm that delays the callee saves to a later point
in the program while executing non-prologue code first.

Is this maybe just meant to indicate the point at which the stack and
possible frame/base pointers are setup? Saving the callee saves should not
necessary be part of that, it probably is on X86 where the pushs/pops are
part of getting the final value of the stack pointer in a function without
frame pointer, but for the case where you use storeRegToStackSlot() I'm
not sure we even need to mark them as FrameSetup...

- Matthias

In DWARF, the idea is that if the debugger is doing a "step in" operation
for a call to the function, where is the most reasonable place to stop
execution, that is still "before" the first statement? Debug info for
parameters and local variables commonly points to stack frame slots, so
generally it's not helpful (for the user) to stop before the stack/frame
pointer is set up the way the debug info expects.

Stashing callee-saved registers is not necessarily part of this, although
if you're using the stack pointer as a frame pointer and doing pushes to
save registers, you need to be done fiddling with SP before you can say
the prologue is ended.
--paulr

Thanks. I'm starting to think the better fix might be to add an
iterator argument to emitPrologue and emitEpilogue to indicate
after/before the callee-saves. Does anyone have particular views on
this one way or another?

Best,

Alex

Scheduling can interfere with this as you might schedule or move something before saves on occasion.

-eric

The process of inserting the callee-save spills and prolog+epilog is
all handled from PrologEpilogInserter (see PEI::runOnMachineFunction).
It seems to me the only chance for things to go wrong after
SpillCalleeSavedRegisters is that the target-specific
processFunctionBeforeFrameFinalized moves instructions around, thus
invalidating the positions of the callee saves/restores.

Best,

Alex

In DWARF, the idea is that if the debugger is doing a "step in" operation
for a call to the function, where is the most reasonable place to stop
execution, that is still "before" the first statement? Debug info for
parameters and local variables commonly points to stack frame slots, so
generally it's not helpful (for the user) to stop before the stack/frame
pointer is set up the way the debug info expects.

Stashing callee-saved registers is not necessarily part of this, although
if you're using the stack pointer as a frame pointer and doing pushes to
save registers, you need to be done fiddling with SP before you can say
the prologue is ended.
--paulr

Thanks. I'm starting to think the better fix might be to add an
iterator argument to emitPrologue and emitEpilogue to indicate
after/before the callee-saves. Does anyone have particular views on
this one way or another?

I'd like to return to this, as I really want to be able to reuse the
standard logic in PrologEpilogInserter without having to make
assumptions that may break in the future (e.g. that every callee-saved
register spill takes a single instruction).

From the discussion in this thread and the strawman patch

(https://reviews.llvm.org/D30115), I think that adding FrameSetup or
FrameDestroy flags to all CSR spills/restores is probably not the
right way to go (although some backends do choose to do this).

I propose:
* Modifying TargetFrameLowering::emitPrologue to take a
MachineBasicBlock::iterator 'AfterCSRSaves', pointing to the
instruction after any inserted callee-save register spills. The
insertCSRSaves helper can easily be modified to return this iterator
* Modifying TargetFrameLowering::emitEpilogue to take a
MachineBasicBlock::iterator 'FirstCSRRestoreOrTerminator'.
insertCSRRestores can be modified to return this iterator, though a
little care needs to be taken for the case where a BB only has a
single instruction.

In both cases, these iterator arguments are intended to be exactly
what you need in emitPrologue/emitEpilogue when determining where to
insert the frame setup / destruction code. I'm bringing it to the
mailing list again as the above solution still feels inelegant somehow
- though does seem an improvement over the status quo.

Does anybody have any thoughts on this, or alternative approaches they
would like to propose?

Thanks,

Alex