Default stack alignment for x86 changed

Hi Everyone,

I have noticed that on certain machines there is a significant execution
speed degradation for a 32-bit application making intense use of double
precision floating-point. The application was compiled with no optimizations
with a customized build of Clang on Windows 7 64-bit.

I have tracked down the slowdown to a stack alignment problem. It looks like
certain machines/processors manifest different performance hits on unaligned
memory access. LLVM change 147888 a while back modified the alignment from 8
to 4 without giving more concrete reasons for it. Note that having many
computations with 'doubles' can be a rather common use-case. I was hoping
that someone could give more information on this.

I found that it is possible to override the default alignment from the
command line but the question is if you foresee any bad effects of this,
besides slightly increased stack usage?

Thanks in advance,

The SysV ABI says it is 32bit. double itself has only 32bit alignment.
If there is any item on the stack that requires more, it will be
realigned automatically, but that comes with some cost, too, like an
additional register used.


I understand, so the change was made for Unix-based systems in mind.
Unfortunately the win32 x86 ABI seems to require doubles to be 64-bit
aligned. Could we perhaps keep the 8-byte alignment only for win32 targets?


The win32 stack only guarantees 4 byte alignment, so far as I know.

While MSVC aligns doubles to 8 byte boundaries when doing record layout, it
will not realign the stack to guarantee 8 byte alignment of double stack

Although alignof(double) on windows returns 8, the actual minimum stack alignment is still 4. Here is a source example illustrating

#include <stdlib.h>

int a = __alignof(double);

extern void crud1(int i, double *p);

void crud(void) {
  double dummy;
  crud1(0, &dummy);

Assembly code produced from VS 2012, compiling with cl -Fa -c -O2 crud.c
_a DD 08H
PUBLIC _crud
; Function compile flags: /Ogtpy
; COMDAT _crud
_dummy$ = -8
_crud PROC
; File d:\users\kbsmith1\tc_tmp1\crud.c
; Line 7
        sub esp, 8
; Line 9
        lea eax, DWORD PTR _dummy$[esp+8]
        push eax
        push 0
        call _crud1
; Line 10
        add esp, 16
        ret 0

You can see that __alignof(double) produced 8 by the initialization value of a. You can also see that there is no code at the beginning of function crud to
align the stack. So, if it comes in on a 4 byte boundary, it will remain on a 4 byte boundary, and since it subs 8 from esp, if it comes in on an 8 byte boundary
it will stay on an 8 byte boundary. Now consider the call to crud1. This pushes two parameters, and then the call pushes the return address. So, if the stack
comes in 8 byte aligned, at the entry to crud1, the stack is now only 4 byte aligned.

For this reason, in windows, although __alignof(double) is 8, it doesn't follow that the value of every double * must be such that the pointer value is 8 byte aligned.

Also, for IA32 on linux, 4 byte minimum stack alignment used to be specified by the Sys V ABI, which is pretty much the only one you can find references to on the web. However, for quite a number of years, gcc's default on linux is to assure 16 byte stack alignment at function entry, so that every function that used SSE/SSE2 instructions (and might possibly need to spill) didn't have to perform dynamic stack alignment. In gcc this is controlled by -mpreferred-stack-boundary=num option., says the default for this option is 4, implying 16 byte stack alignment.

Kevin Smith

To be clear: double does not REQUIRE 8 byte alignment, but on
(reasonably modern, like "Pentium onwards", so ca 1994-5 ish) x86
processors would "prefer" 8-byte alignment for "double" values, since
they can then be read as ONE cycle on a 64-bit bus.

And of course, SSE instructions that aren't specifically designed for
unaligned loads will require a 16-byte alignment. Or does SSE code
automatically modify the alignment criteria for the function?

Further, shouldn't the stack be aligned to "LargestAlignment" or
whatever it is called? Otherwise, any structure alignment will surely
be "lost"?

Clang is really no different than MSVC here (I just double checked). For SSE you always had to specify the alignment required because it was never guaranteed by the compiler (especially when you get into mandatory 16-byte alignment). It’s interesting that its such a performance issue though, unless your really memory constrained it seems the size/speed trade-off is clearly in favour of 8 byte alignment even though its not technically necessary.

Ø It’s interesting that it’s such a performance issue though

I don’t think it really is much of a performance issue, except perhaps on Quark. All recent processors for IA32 make unaligned accesses effectively the same

performance as aligned accesses unless they cross a cache-line boundary. And since alignment within structs is made 8 bytes, provided the class when created is dynamic, then often the memory allocators will return memory that is “well aligned” as well.

So, that leaves potential penalities for things allocated on the stack. Again, if the compiler thinks it is worthwhile it can use extra instructions in the prolog and

epilog of a routine to ensure a higher than minimum stack alignment if it thinks there is a performance reason for doing so.

But this discussion was about the ABIs, and what was guaranteed. And the ABI for IA32 windows only has a 4 byte guarantee for the stack

upon entry to a function. And the ABI (that gcc is assuming) for IA32 linux has a guarantee of 16 byte alignment, but that can be controlled by the

option shown below. And, for example, the linux kernel is built with gcc using the 4 byte alignment guarantee version of that option.


Thank you all for the useful replies. It then looks like the default
alignment of 4 is good for win32 as well.
The conclusion I have is that the ABI does not 'require' alignment of 8
bytes for doubles, but in order to ensure that code runs with no performance
loss on all processor types, even older Pentiums, it would be 'recommended'
to align them on multiples of 8 byte.