x86 Frame Pointer with AVX

Hey guys,

I found a performance regression in the X86 backend related to PR10884.

In trunk, the frame pointer is always set up when an AVX register is used in a function. This is done in case 32-byte spill code is later introduced into the function and hence dynamic stack realignment is needed. Needless to say, it’s a big hammer. This regression seems particularly painful in small-to-medium sized routines that are called frequently in some codes.

Is this issue already known? Is there a plan to fix this regression? If not, does anyone have a suggestion on the best way to remedy this issue?

I have attached the IR and C code for a trivial test which exhibits the problem. The IR was produced by clang-421.0.60.

Thanks,

Cameron

test.ll (1.45 KB)

test.c (136 Bytes)

In trunk, the frame pointer is always set up when an AVX register is used in
a function. This is done in case 32-byte spill code is later introduced into
the function and hence dynamic stack realignment is needed. Needless to say,
it's a big hammer. This regression seems particularly painful in
small-to-medium sized routines that are called frequently in some codes.

Is this issue already known? Is there a plan to fix this regression? If not,
does anyone have a suggestion on the best way to remedy this issue?

You'd need to change the default stack alignment of the platform to
deal with it effectively.

-eric

Hey Eric,

Thanks for replying so quickly. Would you elaborate on this further?

It seems costly to change the default stack alignment on the platform, since that would require recompiling all of the system and user libraries to also adhere to 32-byte stack alignment. Depending on an alignment not specified by the ABI would also limit our compiler’s interoperability with other compilers installed on the system.

I suppose that the stack could be aligned dynamically at main(…) and other visible entry points, but that too seems costly compared to the current M.O…

Maybe I do not fully understand all the issues involved, but I suppose I should be able to dynamically align the stack only when AVX registers are spilled in a function, right? Seems reasonable with my limited knowledge. Do you have any intuition built? It could be possible that the prologue/epilogue emitters run prior to the spilling decisions. I am not so sure of the ordering here.

Also, and this might be asking a lot, but do you have any insight into why this behaviour changed sometime around the LLVM 3.0 release? I have not been able to find much history.

Thanks again,
Cameron

Eric Christopher <echristo@gmail.com> writes:

Is this issue already known? Is there a plan to fix this regression? If not,
does anyone have a suggestion on the best way to remedy this issue?

You'd need to change the default stack alignment of the platform to
deal with it effectively.

That's not possible since such code will have to interface with
externally-compiled libraries which won't have the same alignment
assumptions.

                            -David

I didn't say it would be easy.

-eric

This email did not appear to go through to the list. Resending…

Thanks for replying so quickly. Would you elaborate on this further?

It seems costly to change the default stack alignment on the platform, since
that would require recompiling all of the system and user libraries to also
adhere to 32-byte stack alignment. Depending on an alignment not specified
by the ABI would also limit our compiler's interoperability with other
compilers installed on the system.

I suppose that the stack could be aligned dynamically at main(...) and other
visible entry points, but that too seems costly compared to the current
M.O..

Maybe I do not fully understand all the issues involved, but I suppose I
should be able to dynamically align the stack only when AVX registers are
spilled in a function, right? Seems reasonable with my limited knowledge. Do
you have any intuition built? It could be possible that the
prologue/epilogue emitters run prior to the spilling decisions. I am not so
sure of the ordering here.

Also, and this might be asking a lot, but do you have any insight into why
this behaviour changed sometime around the LLVM 3.0 release? I have not been
able to find much history.

It should only be happening if there is either a variable stored or
saved on the stack. Functions without AVX shouldn't cause a 32-byte
aligned stack or a realignment unless they are using callee saved
registers that are clobbered, etc. Basically any stack traffic with an
AVX register should cause a dynamic realignment and otherwise we
should use the default abi alignment. Arguably a bug if anything else
is happening.

-eric

Also, and this might be asking a lot, but do you have any insight into why
this behaviour changed sometime around the LLVM 3.0 release? I have not been
able to find much history.

Oh, one more thing, it would probably have changed around then because
we started getting AVX support then.

-eric

Thanks for replying so quickly. Would you elaborate on this further?

It seems costly to change the default stack alignment on the platform, since
that would require recompiling all of the system and user libraries to also
adhere to 32-byte stack alignment. Depending on an alignment not specified
by the ABI would also limit our compiler’s interoperability with other
compilers installed on the system.

I suppose that the stack could be aligned dynamically at main(…) and other
visible entry points, but that too seems costly compared to the current
M.O…

Maybe I do not fully understand all the issues involved, but I suppose I
should be able to dynamically align the stack only when AVX registers are
spilled in a function, right? Seems reasonable with my limited knowledge. Do
you have any intuition built? It could be possible that the
prologue/epilogue emitters run prior to the spilling decisions. I am not so
sure of the ordering here.

Also, and this might be asking a lot, but do you have any insight into why
this behaviour changed sometime around the LLVM 3.0 release? I have not been
able to find much history.

It should only be happening if there is either a variable stored or
saved on the stack. Functions without AVX shouldn’t cause a 32-byte
aligned stack or a realignment unless they are using callee saved
registers that are clobbered, etc. Basically any stack traffic with an
AVX register should cause a dynamic realignment and otherwise we
should use the default abi alignment. Arguably a bug if anything else
is happening.

Yes, I believe that this is happening if any AVX register is used in a function; an AVX variable does not necessarily need to be placed on the stack.

Maybe I am misunderstanding this piece of code though…

// Be over-conservative: scan over all vreg defs and find whether vector
// registers are used. If yes, there is a possibility that vector register
// will be spilled and thus require dynamic stack realignment.
for (unsigned i = 0, e = RI.getNumVirtRegs(); i != e; ++i) {
unsigned Reg = TargetRegisterInfo::index2VirtReg(i);
if (RI.getRegClass(Reg)->getAlignment() > StackAlignment) {
FuncInfo->setForceFramePointer(true); // <= Forces Frame Pointer for any AVX reg use!!!
return true;
}
}

Just to be pedantic, at one time this did work properly for AVX without adding the unnecessary frame pointer. It is a proper regression.

-Cameron

Yes, I believe that this is happening if any AVX register is used in a
function; an AVX variable does not necessarily need to be placed on the
stack.

Maybe I am misunderstanding this piece of code though...

     // Be over-conservative: scan over all vreg defs and find whether
vector
     // registers are used. If yes, there is a possibility that vector
register
     // will be spilled and thus require dynamic stack realignment.
     for (unsigned i = 0, e = RI.getNumVirtRegs(); i != e; ++i) {
       unsigned Reg = TargetRegisterInfo::index2VirtReg(i);
       if (RI.getRegClass(Reg)->getAlignment() > StackAlignment) {
         FuncInfo->setForceFramePointer(true); // <= Forces Frame Pointer
for any AVX reg use!!!
         return true;
       }
     }

Just to be pedantic, at one time this did work properly for AVX without
adding the unnecessary frame pointer. It is a proper regression.

Yeah, I was remembering that code too. I wasn't sure if we'd fixed it
or not. It's definitely good for a bug report.

-eric

Filed Bug 14159. Thanks again for your help, Eric.