Stack alignment on X86 AVX seems incorrect

Hi Elena,

You’re correct. LLVM does not align the stack to 32-bytes for AVX and unaligned moves should be used for YMM spills.

I wrote some code to align the stack to 32-bytes when AVX spills are present; it does break the x86-64 ABI though. If upstream would be interested in this code, I can arrange with my employer to send a patch to the mailing list.

-Cameron

Cameron,

Aligning the stack to 32 bytes when there are auto AVX vector variables present shouldn't necessarily break the x86-64 ABI, as long as smaller auto variables remain properly aligned. A similar approach was taken for i386 in GCC in order to support SSE vectors.

Perhaps you could elaborate where the ABI was violated when your patch is applied.

HTH

Really? There's supposed to be code to realign the stack when
necessary... it's possible that code is broken, though.

-Eli

Aligning the stack to 32 bytes when there are auto AVX vector variables present shouldn’t necessarily break the x86-64 ABI, as long as smaller auto variables remain properly aligned. A similar approach was taken for i386 in GCC in order to support SSE vectors.

Perhaps you could elaborate where the ABI was violated when your patch is applied.

Sorry, I confused myself. This was worked out about a year ago, so it’s not in my cache.

You’re right. In my last email I wrote “align the stack”, when I should have written “align the frame when variable sized objects are in play”. Take main(…) for example, with a few alloca’s. If one would like to spill AVX regs with aligned moves, one must align the frame to 32 bytes to ensure that the spill slots are aligned correctly, since spill slots are based off of the frame pointer.

The x86-64 ABI lays out the stack frame as…


16(%rbp) mem arg[0]
8(%rbp) return address
0(%rbp) previous %rbp
-8(%rbp) stack

My patch breaks the ABI since the ABI requires the return address to be found at 8(%rbp) and mem arg[0] at 16(%rbp). Unfortunately, to align the frame at runtime, it’s sometimes required to insert padding in between the two.

I’ll ask for my company’s permission to share my implementation. Until then, I’ll have to bite my tongue.

-Cameron

Eh, no. The X86-64 ABI doesn't require the use of a frame pointer, so it
obviously can't require anything relative to %rbp.

Please note that stack realignment is implement unless code requires
dynamic allocas. The conditionals seems to be wrong though as mentioned
in this thread.

Joerg

This topic is starting to come back to me now. The reason the GCC solution above did not work for us is that we do not build all of the libraries used with our compiler. For example, some are proprietary compiled object files and some are GCC compiled object files from other sources. If our object files called another library, and in turn that library called a function in our object code, it’s not possible to ensure that the frame of the current function is still aligned to 32 bytes. That was the determining factor in my implementation.

That is, unless you know something that I don’t. I’m pretty new to compiler development. :slight_smile:

-Cameron

You can only ever guarantee that code is aligned to the ABI alignment.

-eric

On entry to the function, right. The function can do dynamic stack realignment, however, to whatever it wants. ARM, especially on Darwin, does that quite a lot.

-Jim

Even if you explicitly specify –stack-alignment=16 the aligned movs are still generated.

It is not an issue related to ABI.

See my original mail:

./llc -mattr=+avx -stack-alignment=16 < basic.ll | grep movaps | grep ymm | grep rbp

vmovaps -176(%rbp), %ymm14

vmovaps -144(%rbp), %ymm11

vmovaps -240(%rbp), %ymm13

- Elena

Even if you explicitly specify –stack-alignment=16 the aligned movs are still generated.

It is not an issue related to ABI.

Right, your issue is triggered by the code I sent out earlier:

The faulty code can be found in function X86InstrInfo::storeRegToStackSlot(…) from
/lib/Target/X86/X86InstrInfo.cpp.

bool isAligned = (RI.getStackAlignment() >= 16) || RI.canRealignStack(MF);

In some cases, the stack is assumed to be aligned if it’s on a 16 byte or greater boundary. Your desired alignment is 32 bytes, so aligned 256b moves are selected which is not correct. At runtime, your stack slots could still be aligned on a 16 byte boundary. You’ll have to either:

  1. Always use unaligned moves;
  2. Update the code above to handle 32 byte alignment (Is this even possible at compile time? I wouldn’t think so.);
  3. Align the frame and stack to 32 bytes, so that AVX spill slots are always on 32 byte boundaries (This is what I’m proposing.);

-Cameron

Hi Elena,

I was looking at this again today. What about the following approach:

(1) Change AllocaInst to compute the isStaticAlloca once and remember
    it.

(2) Check all functions for
  (a) static allocations with an alignment larger than the default stack
      alignemnt
  (b) dynamic alloca

(3) If (a) is present and not (b), use the frame pointer to address
    arguments and the stack pointer to address local variables.

    If (b) is present and not (a), use the frame pointer to address
    arguments and local variables. Realign the stack pointer to the
    largest alignment needed for dynamic alloca.

    If (a) and (b) are present, adjust the isStatic attribute of all
    allocas with alignment larger than the default stack alignment.
    Deal with the rest like the case before.

At least for 32bit x86 reserving another register as alternative frame
pointer is very heavy. The above would allow normal spill logic to
decide when to keep a reference in register and when not. It also reuses
existing functionality as much as possible.

Joerg

Joerg,

At least for 32bit x86 reserving another register as alternative frame
pointer is very heavy. The above would allow normal spill logic to
decide when to keep a reference in register and when not. It also reuses
existing functionality as much as possible.

It does not seem to be enough. Even is there are *no* allocas in the
function the stack realignment might still be necessary, for example
due to spill of vector register.
So, we'll need to decide very late whether we'll need realignment or not.

Joerg,

At least for 32bit x86 reserving another register as alternative frame
pointer is very heavy. The above would allow normal spill logic to
decide when to keep a reference in register and when not. It also reuses
existing functionality as much as possible.

It does not seem to be enough. Even is there are *no* allocas in the
function the stack realignment might still be necessary, for example
due to spill of vector register.
So, we'll need to decide very late whether we'll need realignment or not.

Absolutely right. Which means it will need to be a heuristic, because regalloc needs to know early whether to reserve the register or not. There's already some of this sort of thing for whether the frame pointer is reserved or not. The ARM backend already does a fair bit of this (see hasBasePointer() and friends) and may be useful as an example of what's involved.

-Jim