GCC/LLVM frame pointer incompatibility on ARM

Hi all,

As has been mentioned several times (*), LLVM and GCC setup frame pointer to point to different stack slots on ARM. GCC's fp points to stack slot holding lr while LLVM's fp points at the next slot.

Fp incompatibility complicates low-level system code e.g. stack unwinders because it is impossible to robustly determine location of caller's fp.

Is this incompatibility intentional/desired or we could somehow unify GCC and LLVM in this regard?

(*) Links to older discussions:
* http://comments.gmane.org/gmane.comp.compilers.llvm.devel/69514
* https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61771

I don't understand this argument. The ARM EH / DWARF annotation is
supported by LLVM and encodes exactly the data required for robustly
unwinding the stack.

Joerg

Fp incompatibility complicates low-level system code e.g. stack
unwinders because it is impossible to robustly determine location of
caller's fp.

I don't understand this argument. The ARM EH / DWARF annotation is
supported by LLVM and encodes exactly the data required for robustly
unwinding the stack.

Not fast enough for us.

I don't understand this argument. The ARM EH / DWARF annotation is
supported by LLVM and encodes exactly the data required for robustly
unwinding the stack.

Plus, relying on specific compiler's output cannot ever be robust,
even if all known compilers do the same thing, one unknown will stand
different. Following EH directives is the only sure way of getting
things right. Robust or fast, pick one. :slight_smile:

Not fast enough for us.

I'm afraid you'll have to cope with different compilers' outputs. Even
if LLVM changes that, there will be others. You could say: "I only
care about GCC and LLVM", but you probably said before "I only care
about GCC", and it has proven problematic.

Another way would be to get all compilers to agree on style, document,
and follow as if it was a "compiler standard". That doesn't guarantee
anything, but at least provides a well documented, with strong
arguments, why implementing A rather than B is optimal, and might
convince other compilers to abide by our decision.

We're going to discuss about GCC + LLVM interactions at the GNU
Cauldron this Friday, I might add this topic to the list. I don't
particularly have any preference, but people might, so I'd be keen on
hearing the arguments on both sides.

cheers,
--renato

As has been mentioned several times (*), LLVM and GCC setup frame pointer to
point to different stack slots on ARM. GCC's fp points to stack slot holding
lr while LLVM's fp points at the next slot.

This looks flipped from my tests. Both create an { fp, lr } struct;
GCC sets current fp to the address of lr in that struct; LLVM sets
current fp to the address of fp in that struct.

Is this incompatibility intentional/desired or we could somehow unify GCC
and LLVM in this regard?

What are the chances of getting GCC to change here? It's entirely a
bike-shedding argument, but there are a couple of reasons to prefer
LLVM's choice. It's most consistent with what *is* required in the
AArch64 ABI, and it means fp really points to the frame record, not
some random point half way through it.

Cheers.

Tim.

I'm not an expert in x86_64 asm, but it seems that both AArch64 and
x86_64 GCC do the same:

x86_64:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp

AArch64:
stp x29, x30, [sp, -32]!
add x29, sp, 0

which would indicate that LLVM's implementation on ARM is the most
consistent. I'm guessing ARM GCC's implementation was not an accident,
but a long forgotten hack... :confused:

cheers,
--renato

As has been mentioned several times (*), LLVM and GCC setup frame pointer to
point to different stack slots on ARM. GCC's fp points to stack slot holding
lr while LLVM's fp points at the next slot.

This looks flipped from my tests. Both create an { fp, lr } struct;
GCC sets current fp to the address of lr in that struct; LLVM sets
current fp to the address of fp in that struct.

Is this incompatibility intentional/desired or we could somehow unify GCC
and LLVM in this regard?

What are the chances of getting GCC to change here? It's entirely a
bike-shedding argument, but there are a couple of reasons to prefer
LLVM's choice. It's most consistent with what *is* required in the
AArch64 ABI, and it means fp really points to the frame record, not
some random point half way through it.

It is also consistent with x86: we use exactly the same code to unwind
stack on both platforms.

I'm checking with the ARM GCC folks if there's any reason behind this.
If it's just legacy, I'll propose a change on Friday (not holding my
breath, though).

cheers,
--renato

As has been mentioned several times (*), LLVM and GCC setup frame pointer to

point to different stack slots on ARM. GCC's fp points to stack slot holding
lr while LLVM's fp points at the next slot.

This looks flipped from my tests. Both create an { fp, lr } struct;
GCC sets current fp to the address of lr in that struct; LLVM sets
current fp to the address of fp in that struct.

Right, I misread the assembly :frowning:

Is this incompatibility intentional/desired or we could somehow unify GCC
and LLVM in this regard?

What are the chances of getting GCC to change here?

Well, their logic is that as long as FP is not part of ARM ABI they can make arbitrary choice
even if it complicates user's life. I really hope that Renato could persuade people that
this is worth changing.

It's entirely a
bike-shedding argument, but there are a couple of reasons to prefer
LLVM's choice. It's most consistent with what *is* required in the
AArch64 ABI, and it means fp really points to the frame record, not
some random point half way through it.

Yeah, I think everyone agrees on this.

-Y

So, this is a lot more complicated than it seems and the choice was
not arbitrary.

The old APCS required the frame pointer to be pointing to LR in the
stack, and due to the number of problems that it created [1], AAPCS
said "we're having none of it". With that in mind, the GCC engineers
didn't change the FP logic when they implemented AAPCS. The AArch64
AAPCS had a better description of what to do with the FP, and since it
was a new target, both GCC and LLVM engineers decided to do like any
other target instead.

As you may imagine, changing how the FP behaves will have an impact
not just in GCC itself, but many other tools (known and unknown) that
rely on that behaviour. So, while it's undecided and the change is
*possible*, it would need a strong argument to start that change.
Being "like the others" is not strong enough, and I agree with that.
Moreover, the AAPCS can theoretically change again, and enforce yet
another standard, where we'd have to change it all over.

For those reasons, changing ARM GCC's prologue/epilogue is probably
not happening soon.

As you probably already know, the reason why the AAPCS retreated from
controlling the FP is exactly the same as we're discussing it here.
People use it to unwind the stack. On the other hand, eliminating the
prologue when no local logic requires it is pointless and can be a big
difference in performance on devices that are already restricted by
extreme power constraints, so to produce really optimal code for ARM
you have to be able to change that.

What the AAPCS did was just to put in paper what was already true:
don't trust the prologue.

I know it's not the answer we wanted to hear, but it's a damn good
one, and one that I accept as the least costly solution. Given that
LLVM is *also* not breaking the AAPCS, I don't think it'd be a good
idea to replicate GCC's behaviour in the prologue for ARM just for the
sake of fast stack unwinding, but other people are free to disagree.

cheers,
-renato

I know it's not the answer we wanted to hear, but it's a damn good
one,

It's an answer. I wouldn't go any further than that myself.

Tim.

Maybe I didn't explain my position right. GCC folks are *definitely*
willing to change IFF there is a formal proposal from ARM. They also
agree that this is as bad as anything else when it comes to guessing
undocumented behaviour (but the formal reason is APCS), and they
*also* understand the headaches other people have with the
differences.

But changing this now will have repercussions across the toolchain and
other tools that rely on it, only for a year later ARM decide to do
something else entirely. It's not worth the headache.

LLVM has a greater freedom to move and deprecate things, they don't. I
find it hard to see how this could be different.

--renato

As you may imagine, changing how the FP behaves will have an impact
not just in GCC itself, but many other tools (known and unknown) that
rely on that behaviour.

Note that these tools wouldn't work with Clang then.
And vice verse: tools that are developed in Clang (Asan) won't work with GCC.

On the other hand, eliminating the
prologue when no local logic requires it is pointless

I think you meant "keeping prologue when no local logic requires it is pointless" ?

and can be a big
difference in performance on devices that are already restricted by
extreme power constraints, so to produce really optimal code for ARM
you have to be able to change that.

It's the same for x64 - if you need ability to do fast unwinding
you have to ask for it explicitly with -fno-omit-frame-pointer, otherwise compiler
is free to re-use rbp for general computations.

-Y

Note that these tools wouldn't work with Clang then.
And vice verse: tools that are developed in Clang (Asan) won't work with
GCC.

That's the point. Break one to fix the other when there is no agreed
standard is not a good use of resources. Whenever there's an agreed
standard, we can all move to the same implementation.

I think you meant "keeping prologue when no local logic requires it is
pointless" ?

Yes, sorry.

I'll have to take that to a higher level, ie ARM, just like Jim was
doing with the assembly aliases in ARMCC's docs. It could take a
while...

--renato

Would they be willing to have a flag? Would we be willing to have a flag? Or should we conditionalize this on OS and say, on Linux, do the gcc thing, and on OS X, do the LLVM thing?

Would they be willing to have a flag? Would we be willing to have a flag?

That's a good question. Anything we do would be easier than wait for
them to do anything, so if we decide to go with a flag, it should be
us implementing.

Or should we conditionalize this on OS and say, on Linux, do the gcc thing,
and on OS X, do the LLVM thing?

I think you agree with me that both solutions are ugly, but I'd rather
not make this default behaviour anywhere, so that only who needs it
(sanitizers) turns it on with a flag.

cheers,
--renato

Having a different code path for prologue just for the sanitizers sounds pretty risky to me. That code is already strewn with conditional and modal stuff. Adding another variable to the permutations scares me. Is there really no alternative? Conditional code in the sanitizers that figure things out? LLDB and GDB have a very similar sort of problem for backtraces, including when debug info isn’t available. How do they solve it?

-Jim

Having a different code path for prologue just for the sanitizers sounds pretty risky to me. That code is already strewn with conditional and modal stuff. Adding another variable to the permutations scares me.

Same here.

Is there really no alternative? Conditional code in the sanitizers that figure things out? LLDB and GDB have a very similar sort of problem for backtraces, including when debug info isn’t available. How do they solve it?

The alternative is to use the unwind tables that both GCC and LLVM
generate even on C code, and that the ABI tells us to use, but their
argument is that's too slow. I don't know LLDB, but GDB uses tables,
but also the hidden logic (for faster unwinding), so I guess that with
code produced by LLVM, it just uses the tables.

GDB has a lot of hidden context with GCC that only works because their
development roadmaps are tied together and it's more scaring than
that, but I don't know how they chose to use magic or not.

cheers,
--renato

It's not just sanitizers that need to be able to get fast, accurate stack
traces. Consider sampling profilers that capture call stacks. Using the
unwind tables is disruptively slow to the process under profile.

Why not do the unwind table parsing after the fact? Especially for a profiler, there’s no reason to do that during the actual profile collection.