Aligned vector spills and variably sized stack frames

I've run into a problem that I'm trying to figure out how to address and would welcome ideas and feedback.

Today, the vectorizer will nicely vectorize loops using the widest legal vector type for the target. On a reasonable recent machine, this will often end up using AVX2 registers which are 32 bytes wide.

If during register allocation, we decide to spill one of these registers, we use the vmovaps instruction which requires the address in memory accessed to be 32 byte aligned. So far, so good.

However, the C ABI generally only provides 16 bytes of alignment for the stack on entry to the function. To work around this, the backend will create a variable sized frame with a dynamic amount of padding inserted if required to ensure that a 32 byte aligned spill slot is available.

The problem I have is that my runtime's ABI really doesn't like variably sized frames. In particular, the assumption that stack frames are fixed size - except during prolog and epilogue - is fairly baked in.

I'm weighing a couple of options for addressing this and want to gather feedback on the perceived difficulty of each. If someone has another approach, I'm also very open to that.

Option 1 - Fix my runtime to not expect mostly fixed size frames. This isn't a small change to make, but given it's a strictly internal ABI, I can probably get away with doing it. Given things like shrink-wrapping are coming down the pipe, it might also have secondary benefits. However, this is a relatively risky change to make for a fairly corner case.

Option 1a - I could change my ABI to use a 32 byte aligned frame. This has many of the same problems as (1).

Option 2 - Don't compile things which need to spill vector registers. This is actually what we do today and has worked out fairly well in practice. This is what I'm hoping to move away from.

Option 3 - Add an option in the x86 backend to not require aligned spill slots for AVX2 registers. In particular, the VMOVUPS instruction can be used to spill vector registers into an 8 or 16 byte aligned spill slot and not require dynamic frame realignment. This seems like it might be useful in other context as well, but I can't name any at the moment.

One thing that occurs to me is that many spills are down rare paths. Maybe it would make sense to only do dynamic alignment for hot spill/reloads? We could then simply override the heustic to always use unaligned spills.

I don't really have a sense for how hard (3) would be to implement. Anyone have an intuition?

Philip

I've run into a problem that I'm trying to figure out how to address and would welcome ideas and feedback.

Today, the vectorizer will nicely vectorize loops using the widest legal vector type for the target. On a reasonable recent machine, this will often end up using AVX2 registers which are 32 bytes wide.

If during register allocation, we decide to spill one of these registers, we use the vmovaps instruction which requires the address in memory accessed to be 32 byte aligned. So far, so good.

However, the C ABI generally only provides 16 bytes of alignment for the stack on entry to the function. To work around this, the backend will create a variable sized frame with a dynamic amount of padding inserted if required to ensure that a 32 byte aligned spill slot is available.

The problem I have is that my runtime's ABI really doesn't like variably sized frames. In particular, the assumption that stack frames are fixed size - except during prolog and epilogue - is fairly baked in.

I'm weighing a couple of options for addressing this and want to gather feedback on the perceived difficulty of each. If someone has another approach, I'm also very open to that.

Option 1 - Fix my runtime to not expect mostly fixed size frames. This isn't a small change to make, but given it's a strictly internal ABI, I can probably get away with doing it. Given things like shrink-wrapping are coming down the pipe, it might also have secondary benefits. However, this is a relatively risky change to make for a fairly corner case.

Option 1a - I could change my ABI to use a 32 byte aligned frame. This has many of the same problems as (1).

Option 2 - Don't compile things which need to spill vector registers. This is actually what we do today and has worked out fairly well in practice. This is what I'm hoping to move away from.

Option 3 - Add an option in the x86 backend to not require aligned spill slots for AVX2 registers. In particular, the VMOVUPS instruction can be used to spill vector registers into an 8 or 16 byte aligned spill slot and not require dynamic frame realignment. This seems like it might be useful in other context as well, but I can't name any at the moment.

One thing that occurs to me is that many spills are down rare paths. Maybe it would make sense to only do dynamic alignment for hot spill/reloads? We could then simply override the heustic to always use unaligned spills.

I don't really have a sense for how hard (3) would be to implement. Anyone have an intuition?

After sending this, I did another search and promptly discovered the existing "no-realign-stack" function attribute which seems to do exactly what I need. Anyone know if this is robust?

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, August 28, 2015 6:00:50 PM
Subject: [llvm-dev] Aligned vector spills and variably sized stack frames

I've run into a problem that I'm trying to figure out how to address
and
would welcome ideas and feedback.

Today, the vectorizer will nicely vectorize loops using the widest
legal
vector type for the target. On a reasonable recent machine, this
will
often end up using AVX2 registers which are 32 bytes wide.

If during register allocation, we decide to spill one of these
registers, we use the vmovaps instruction which requires the address
in
memory accessed to be 32 byte aligned. So far, so good.

However, the C ABI generally only provides 16 bytes of alignment for
the
stack on entry to the function. To work around this, the backend
will
create a variable sized frame with a dynamic amount of padding
inserted
if required to ensure that a 32 byte aligned spill slot is available.

The problem I have is that my runtime's ABI really doesn't like
variably
sized frames. In particular, the assumption that stack frames are
fixed
size - except during prolog and epilogue - is fairly baked in.

I'm weighing a couple of options for addressing this and want to
gather
feedback on the perceived difficulty of each. If someone has another
approach, I'm also very open to that.

Option 1 - Fix my runtime to not expect mostly fixed size frames.
This
isn't a small change to make, but given it's a strictly internal ABI,
I
can probably get away with doing it. Given things like
shrink-wrapping
are coming down the pipe, it might also have secondary benefits.
However, this is a relatively risky change to make for a fairly
corner case.

Option 1a - I could change my ABI to use a 32 byte aligned frame.
This
has many of the same problems as (1).

Option 2 - Don't compile things which need to spill vector registers.
This is actually what we do today and has worked out fairly well in
practice. This is what I'm hoping to move away from.

Option 3 - Add an option in the x86 backend to not require aligned
spill
slots for AVX2 registers. In particular, the VMOVUPS instruction can
be
used to spill vector registers into an 8 or 16 byte aligned spill
slot
and not require dynamic frame realignment. This seems like it might
be
useful in other context as well, but I can't name any at the moment.

One thing that occurs to me is that many spills are down rare paths.
Maybe it would make sense to only do dynamic alignment for hot
spill/reloads? We could then simply override the heustic to always
use
unaligned spills.

I don't really have a sense for how hard (3) would be to implement.
Anyone have an intuition?

I suspect that implementing this would not be too difficult. There are essentially two things that need to be changed:

1. Change the code in X86InstrInfo::storeRegToStackSlot / X86InstrInfo::loadRegFromStackSlot to do the right thing for underaligned stack slots (or, in general, under the control of some target feature, option, etc.) [specifically, you need to change the code in those functions to pass false to the isStackAligned parameter of getStoreRegOpcode and getLoadRegOpcode].

2. The alignment necessary for register spills is generically specified in the target's *RegisterInfo.td file.(it's the third parameter of the RegisterClass TableGen type). You'd need to specify a way to override that based on some target feature, option, etc. if one does not already exist.

-Hal

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, August 28, 2015 6:21:00 PM
Subject: Re: [llvm-dev] Aligned vector spills and variably sized stack frames

> I've run into a problem that I'm trying to figure out how to
> address
> and would welcome ideas and feedback.
>
> Today, the vectorizer will nicely vectorize loops using the widest
> legal vector type for the target. On a reasonable recent machine,
> this will often end up using AVX2 registers which are 32 bytes
> wide.
>
> If during register allocation, we decide to spill one of these
> registers, we use the vmovaps instruction which requires the
> address
> in memory accessed to be 32 byte aligned. So far, so good.
>
> However, the C ABI generally only provides 16 bytes of alignment
> for
> the stack on entry to the function. To work around this, the
> backend
> will create a variable sized frame with a dynamic amount of padding
> inserted if required to ensure that a 32 byte aligned spill slot is
> available.
>
> The problem I have is that my runtime's ABI really doesn't like
> variably sized frames. In particular, the assumption that stack
> frames are fixed size - except during prolog and epilogue - is
> fairly
> baked in.
>
> I'm weighing a couple of options for addressing this and want to
> gather feedback on the perceived difficulty of each. If someone
> has
> another approach, I'm also very open to that.
>
> Option 1 - Fix my runtime to not expect mostly fixed size frames.
> This
> isn't a small change to make, but given it's a strictly internal
> ABI,
> I can probably get away with doing it. Given things like
> shrink-wrapping are coming down the pipe, it might also have
> secondary
> benefits. However, this is a relatively risky change to make for a
> fairly corner case.
>
> Option 1a - I could change my ABI to use a 32 byte aligned frame.
> This
> has many of the same problems as (1).
>
> Option 2 - Don't compile things which need to spill vector
> registers.
> This is actually what we do today and has worked out fairly well in
> practice. This is what I'm hoping to move away from.
>
> Option 3 - Add an option in the x86 backend to not require aligned
> spill slots for AVX2 registers. In particular, the VMOVUPS
> instruction can be used to spill vector registers into an 8 or 16
> byte
> aligned spill slot and not require dynamic frame realignment. This
> seems like it might be useful in other context as well, but I can't
> name any at the moment.
>
> One thing that occurs to me is that many spills are down rare
> paths.
> Maybe it would make sense to only do dynamic alignment for hot
> spill/reloads? We could then simply override the heustic to always
> use unaligned spills.
>
> I don't really have a sense for how hard (3) would be to implement.
> Anyone have an intuition?
After sending this, I did another search and promptly discovered the
existing "no-realign-stack" function attribute which seems to do
exactly
what I need. Anyone know if this is robust?

I believe this works correctly, but is not a targeted fix for the AVX spilling problem. :wink: -- and I can certainly imagine such a feature being generally desirable. Specifically, all overaligned locals will simply fail to be overaligned (and, thus, the resulting code will likely be broken). In your case, I can imagine you can simply promise never to create such things, and you'll be fine.

-Hal

To restate, you're saying that if I had a load or store with alignment greater than the native frame size, that using this option might cause that alignment not to be respected? That would work in practice, but I should probably solve this in a more principled way to avoid future pain. However, given your comments and the existing attribute, implementing something along the lines of my option (3) above shouldn't be too hard. I'll likely post a patch in that direction next week.

Thanks for the guidance.

Philip

From: "Philip Reames" <listmail@philipreames.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, August 28, 2015 7:03:24 PM
Subject: Re: [llvm-dev] Aligned vector spills and variably sized stack frames

>> From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
>> To: "llvm-dev" <llvm-dev@lists.llvm.org>
>> Sent: Friday, August 28, 2015 6:21:00 PM
>> Subject: Re: [llvm-dev] Aligned vector spills and variably sized
>> stack frames
>>
>>> I've run into a problem that I'm trying to figure out how to
>>> address
>>> and would welcome ideas and feedback.
>>>
>>> Today, the vectorizer will nicely vectorize loops using the
>>> widest
>>> legal vector type for the target. On a reasonable recent
>>> machine,
>>> this will often end up using AVX2 registers which are 32 bytes
>>> wide.
>>>
>>> If during register allocation, we decide to spill one of these
>>> registers, we use the vmovaps instruction which requires the
>>> address
>>> in memory accessed to be 32 byte aligned. So far, so good.
>>>
>>> However, the C ABI generally only provides 16 bytes of alignment
>>> for
>>> the stack on entry to the function. To work around this, the
>>> backend
>>> will create a variable sized frame with a dynamic amount of
>>> padding
>>> inserted if required to ensure that a 32 byte aligned spill slot
>>> is
>>> available.
>>>
>>> The problem I have is that my runtime's ABI really doesn't like
>>> variably sized frames. In particular, the assumption that stack
>>> frames are fixed size - except during prolog and epilogue - is
>>> fairly
>>> baked in.
>>>
>>> I'm weighing a couple of options for addressing this and want to
>>> gather feedback on the perceived difficulty of each. If someone
>>> has
>>> another approach, I'm also very open to that.
>>>
>>> Option 1 - Fix my runtime to not expect mostly fixed size frames.
>>> This
>>> isn't a small change to make, but given it's a strictly internal
>>> ABI,
>>> I can probably get away with doing it. Given things like
>>> shrink-wrapping are coming down the pipe, it might also have
>>> secondary
>>> benefits. However, this is a relatively risky change to make for
>>> a
>>> fairly corner case.
>>>
>>> Option 1a - I could change my ABI to use a 32 byte aligned frame.
>>> This
>>> has many of the same problems as (1).
>>>
>>> Option 2 - Don't compile things which need to spill vector
>>> registers.
>>> This is actually what we do today and has worked out fairly well
>>> in
>>> practice. This is what I'm hoping to move away from.
>>>
>>> Option 3 - Add an option in the x86 backend to not require
>>> aligned
>>> spill slots for AVX2 registers. In particular, the VMOVUPS
>>> instruction can be used to spill vector registers into an 8 or 16
>>> byte
>>> aligned spill slot and not require dynamic frame realignment.
>>> This
>>> seems like it might be useful in other context as well, but I
>>> can't
>>> name any at the moment.
>>>
>>> One thing that occurs to me is that many spills are down rare
>>> paths.
>>> Maybe it would make sense to only do dynamic alignment for hot
>>> spill/reloads? We could then simply override the heustic to
>>> always
>>> use unaligned spills.
>>>
>>> I don't really have a sense for how hard (3) would be to
>>> implement.
>>> Anyone have an intuition?
>> After sending this, I did another search and promptly discovered
>> the
>> existing "no-realign-stack" function attribute which seems to do
>> exactly
>> what I need. Anyone know if this is robust?
> I believe this works correctly, but is not a targeted fix for the
> AVX spilling problem. :wink: -- and I can certainly imagine such a
> feature being generally desirable. Specifically, all overaligned
> locals will simply fail to be overaligned (and, thus, the
> resulting code will likely be broken). In your case, I can imagine
> you can simply promise never to create such things, and you'll be
> fine.
To restate, you're saying that if I had a load or store with
alignment
greater than the native frame size, that using this option might
cause
that alignment not to be respected?

No, what I'm saying is that if you were to create an alloca instruction with an alignment specified to be greater than the ABI stack alignment, and you use no-realign-stack to disable all stack realignment, then the resulting stack slot may simply not have the requested alignment.

-Hal

If someone has another approach, I'm also very open to that.

I recently saw another compiler use a "cute trick" for this sort of thing: allocate a fixed amount of space on the stack by including the worst-case padding, and then dynamically set the frame pointer to an aligned location within that. I wouldn't go so far as to say that's a *better* approach (it seems gimmicky/fishy and could open other problems by surprising your runtime/tools in other ways), but it's certainly *another* one :). The only actual benefit that comes to mind is that it would cover other sources of dynamic alignment than RA spill slots, if that's something you need to worry about.

-Joseph

If one uses that trick, one should combine all the items needing the large alignment into one allocation. Otherwise, one will be allocating extra space all over the place along with needing a pointer variable for every aligned object.

Absolutely. The idea would be to do it for the whole stack frame: rather than having a frame pointer at a fixed offset from the parent frame and a dynamically-computed stack pointer, you'd have a stack pointer at a fixed offset from the parent frame and a dynamically-computed frame pointer (and so would have to swap which one you use to access locals and which one you use to access incoming parameters).

I came here to suggest this approach also. :slight_smile:

Right now the X86 backend is using a stack realignment prologue that is
designed to fixup the incoming and outgoing alignment to some number. We
only need this prologue when the user is telling us that the incoming
alignment is too low, and it must be fixed up (i.e. -mstackrealign).