Stack alignment on X86 AVX seems incorrect

At least for 32bit x86 reserving another register as alternative frame
pointer is very heavy. The above would allow normal spill logic to
decide when to keep a reference in register and when not. It also reuses
existing functionality as much as possible.

Hi Joerg,

Yes, this was a problem in my implementation also. Empirically, for the chips I work on, reserving the extra frame register was shown to be a win. But, of course, I am sure this win is not universal.

I did receive permission to share my work with the community. Although, without discovering a creative solution to the extra frame register problem, I doubt my patch would be wanted. If anyone is motivated to work out this issue, I would be happy to help.

My current thinking is that an emergency spill slot could be set aside to hold the original, ABI conforming, frame pointer. Not an ideal solution, but in my situation where I must cover any code a user throws at me, breaking the ABI and playing with the stack is preferred.

Thanks,
Cameron

Cameron,

Figure 3.3 on page 16 of www.x86-64.org/documentation/abi.pdf is not normative. See foot note 7 in the same page. Figure 3.4 on page 21 confirms that the use of a frame-pointer is optional.

So, if one doesn't use ENTER in the prologue and uses RSP to access local variables, RBP may be used as a calee-saved GPR.

Figure 3.3 on page 16 of www.x86-64.org/documentation/abi.pdf is not
normative. See foot note 7 in the same page. Figure 3.4 on page 21
confirms that the use of a frame-pointer is optional.

So, if one doesn’t use ENTER in the prologue and uses RSP to access local
variables, RBP may be used as a calee-saved GPR.

I am not sure if I am completely following. The issue that required aligning the frame to 32 bytes is when there are variable sized objects on the stack (e.g. alloca). In that case, the RBP frame pointer is required to access the spill slots. If I’m not mistaken, calculating the address of spill slots off of RSP would be costly in this case.

Are you suggesting that there is a way to base spill slots off of RSP when the stack size is unknown at compile time?

This does bring up an interesting idea though. If we wanted to punt, it would be possible to check for variable sized objects on the stack and then only issue unaligned moves for 256b spills/reloads. Not ideal for performance, but it would work as a stopgap.

-Cameron

...
> Figure 3.3 on page 16 of www.x86-64.org/documentation/abi.pdf is not
> normative. See foot note 7 in the same page. Figure 3.4 on page 21
> confirms that the use of a frame-pointer is optional.
>
> So, if one doesn't use ENTER in the prologue and uses RSP to access local
> variables, RBP may be used as a calee-saved GPR.

I am not sure if I am completely following. The issue that required
aligning the frame to 32 bytes is when there are variable sized objects on
the stack (e.g. alloca). In that case, the RBP frame pointer is required to
access the spill slots. If I'm not mistaken, calculating the address of
spill slots off of RSP would be costly in this case.

No, stack realignment needs to happen if there are auto variables on the
stack of types that need a larger alignment than the default. This
currently means AVX vectors for x86-64 and SSE/AVX vectors for x86-32
folloing the original sysv ABI. In that case %rbp/%ebp is used to
reference the original arguments on the stack and %rsp/%esp is used to
reference the auto variables.

This doesn't work though if dynamic allocas exist, so either stack
variables with larger alignment need to be turned into / remain as
dynamic allocas OR another register is needed to replace %rsp/%esp
in the above.

This does bring up an interesting idea though. If we wanted to punt, it
would be possible to check for variable sized objects on the stack and then
only issue unaligned moves for 256b spills/reloads. Not ideal for
performance, but it would work as a stopgap.

The problem is worse on x86-32 following the original sysv ABI. In that
case both GCC and LLVM currently just create broken code if a function
uses both SSE instructions and alloca.

Joerg

...

Figure 3.3 on page 16 of www.x86-64.org/documentation/abi.pdf is not
normative. See foot note 7 in the same page. Figure 3.4 on page 21
confirms that the use of a frame-pointer is optional.

So, if one doesn't use ENTER in the prologue and uses RSP to access local
variables, RBP may be used as a calee-saved GPR.

I am not sure if I am completely following. The issue that required
aligning the frame to 32 bytes is when there are variable sized objects on
the stack (e.g. alloca). In that case, the RBP frame pointer is required to
access the spill slots. If I'm not mistaken, calculating the address of
spill slots off of RSP would be costly in this case.

No, stack realignment needs to happen if there are auto variables on the
stack of types that need a larger alignment than the default. This
currently means AVX vectors for x86-64 and SSE/AVX vectors for x86-32
folloing the original sysv ABI. In that case %rbp/%ebp is used to
reference the original arguments on the stack and %rsp/%esp is used to
reference the auto variables.

This doesn't work though if dynamic allocas exist, so either stack
variables with larger alignment need to be turned into / remain as
dynamic allocas OR another register is needed to replace %rsp/%esp
in the above.

Exactly right.

Cameron,

I was the one not completely following you. I missed the detail about variable-sized variables on the stack.