A new builtin: __builtin_stack_pointer()

I originally posted this to the llvm-dev mailing list, when I should
have posted it here. So here it is reposted, and updated a bit.

One of the issues the LLVMLinux project is having is with the use of
named registers in the Linux kernel code. The kernel uses something like
this in order to assign a C variable name to a register (one for each
kernel arch).

    register unsigned long current_stack_pointer asm("esp");

clang doesn't allow this kind of thing which required a patch which less
efficient:

#define current_stack_pointer ({ \
       unsigned long esp; \
       asm("mov %%esp, %0" : "=r"(esp)); \
       esp; \
})

This works for both gcc and clang, but often adds in 3 extra
instructions since you need to copy the stack pointer to another
register, but first that register needs to be saved to the stack and
then restored after the stackpointer has been used; inefficient.

Jakob Stoklund Olesen <stoklund@2pi.dk> suggested the following would be
better, and indeed it is.

#define current_stack_pointer ({ \
       register unsigned long esp asm("esp"); \
       asm("" : "=r"(esp)); \
       esp; \
   })

Another way would be to introduce a new builtin in parallel to others
like __builtin_return_address(). Essentially adding a
__builtin_stack_pointer() which does what the above code does. The idea
would be to get something like this added to both clang and gcc in order
to make this work across compilers, and across arches.

It ends up being a trivial patch for clang (see below). We're still
looking for someone to help us on the gcc side.

The goal is to ideally make the kernel code work equally well with both
compilers (clang and gcc).

Thoughts?

Thanks to Mark Charlebois for writing the patch below.

Behan

I originally posted this to the llvm-dev mailing list, when I should
have posted it here. So here it is reposted, and updated a bit.

One of the issues the LLVMLinux project is having is with the use of
named registers in the Linux kernel code. The kernel uses something like
this in order to assign a C variable name to a register (one for each
kernel arch).

    register unsigned long current_stack_pointer asm("esp");

clang doesn't allow this kind of thing which required a patch which less
efficient:

#define current_stack_pointer ({ \
       unsigned long esp; \
       asm("mov %%esp, %0" : "=r"(esp)); \
       esp; \
})

This works for both gcc and clang, but often adds in 3 extra
instructions since you need to copy the stack pointer to another
register, but first that register needs to be saved to the stack and
then restored after the stackpointer has been used; inefficient.

Jakob Stoklund Olesen <stoklund@2pi.dk> suggested the following would be
better, and indeed it is.

> #define current_stack_pointer ({ \
> register unsigned long esp asm("esp"); \
> asm("" : "=r"(esp)); \
> esp; \
> })

Another way would be to introduce a new builtin in parallel to others
like __builtin_return_address(). Essentially adding a
__builtin_stack_pointer() which does what the above code does. The idea
would be to get something like this added to both clang and gcc in order
to make this work across compilers, and across arches.

It ends up being a trivial patch for clang (see below). We're still
looking for someone to help us on the gcc side.

The goal is to ideally make the kernel code work equally well with both
compilers (clang and gcc).

Thoughts?

The LLVM LangRef is pretty clear:

llvm.stacksave:

"This intrinsic returns a *opaque pointer value* that can be passed to
llvm.stackrestore. When an llvm.stackrestore intrinsic is executed with a
value saved from llvm.stacksave, it effectively restores the state of the
stack to the state it was in when the llvm.stacksave intrinsic executed. In
practice, this pops any alloca blocks from the stack that were allocated
after the llvm.stacksave was executed."

An opaque pointer value doesn't sound like it's guaranteed to be a stack
pointer to me, so this patch isn't correct. __builtin_frame_address(0)
looks like it would give approximately what you want (though it's hard to
see how you could use this in correct code...).

You still haven't said what the intended behavior of this builtin is
when the function containing it is inlined.

Joerg

The intent in all situations is for it to use the current value in the
stack pointer register. It doesn't matter whether it is inlined or not.
It should be the immediate value of R13 on ARM or esp on X86_64 (etc for
other arches).

The idea is to preclude the need to use ASM to move the value in the
stack pointer to another register in order to use it in C code (which is
inefficient).

Essentially I'm looking for a better solution than adding a C symbol
name to the stack register.

Does that make sense?

Behan

Not yet. Why do you want this? What will you do with it? What semantics do
you want the returned pointer to have? How can correct code ever do
anything with this pointer? (Remember that the backend is in principle
allowed to modify the stack pointer at any time, including between the
point where you call this intrinsic and the point where you use its result.)

-Chris

Its used by the Linux kernel in several situations, but mostly for threading, and various debug and stack tracing code. Essentially the kernel code primarily uses the stack pointer by assigning it a C symbol name like this. register unsigned long current_sp asm (“sp”); The value of that named register is read and stored elsewhere, or is used to calculate the beginning or end of the stack in order to find the current threadinfo or the current pt_regs (previous dump of the CPU registers). Essentially in the case of ARM, r13 is used directly in the resulting code. gcc allows you to do the above in order to access the value of the stack pointer register (in the way I describe), however clang does not unless you then add the following. register unsigned long current_sp asm (“sp”); asm("" : "=r(current_sp)); … which works, but is ugly. Since my goal is to be able to compile the Linux kernel with both clang and gcc, in a way which is the most efficient with each compiler, and which doesn’t make the code worse, the above it’s going to cut it. Instead I’d like to see __builtin_stack_pointer() added to both clang and gcc. It’s easier to read, would provide read only access to the register value (which is safer), and mirrors the other __builtin functions (which in the Linux kernel are often called together: __builtin_frame_address() and __builtin_return_address() For instance: frame.fp = (unsigned long)__builtin_frame_address(0); frame.sp = current_sp; frame.lr = (unsigned long)__builtin_return_address(0); frame.pc = (unsigned long)return_address; No doubt there is a more optimal way than the patch I sent. Behan

Okay, maybe not so trivial. :slight_smile: Yours however, is a much cleaner solution. Thanks, Behan

>> Another way would be to introduce a new builtin in parallel to others
>> like __builtin_return_address(). Essentially adding a
>> __builtin_stack_pointer() which does what the above code does. The idea
>> would be to get something like this added to both clang and gcc in
order
>> to make this work across compilers, and across arches.
> You still haven't said what the intended behavior of this builtin is
> when the function containing it is inlined.
The intent in all situations is for it to use the current value in the
stack pointer register. It doesn't matter whether it is inlined or not.
It should be the immediate value of R13 on ARM or esp on X86_64 (etc for
other arches).

The idea is to preclude the need to use ASM to move the value in the
stack pointer to another register in order to use it in C code (which is
inefficient).

Essentially I'm looking for a better solution than adding a C symbol
name to the stack register.

Does that make sense?

Not yet. Why do you want this? What will you do with it? What semantics
do you want the returned pointer to have? How can correct code ever do
anything with this pointer? (Remember that the backend is in principle
allowed to modify the stack pointer at any time, including between the
point where you call this intrinsic and the point where you use its result.)

Its used by the Linux kernel in several situations, but mostly for
threading, and various debug and stack tracing code.

Essentially the kernel code primarily uses the stack pointer by assigning
it a C symbol name like this.

    register unsigned long current_sp asm ("sp");

The value of that named register is read and stored elsewhere, or is used
to calculate the beginning or end of the stack in order to find the current
threadinfo or the current pt_regs (previous dump of the CPU registers).
Essentially in the case of ARM, r13 is used directly in the resulting code.

gcc allows you to do the above in order to access the value of the stack
pointer register (in the way I describe), however clang does not unless you
then add the following.

    register unsigned long current_sp asm ("sp");
    asm("" : "=r(current_sp));

... which works, but is ugly. Since my goal is to be able to compile the
Linux kernel with both clang and gcc, in a way which is the most efficient
with each compiler, and which doesn't make the code worse, the above it's
going to cut it.

Instead I'd like to see __builtin_stack_pointer() added to both clang and
gcc. It's easier to read, would provide read only access to the register
value (which is safer), and mirrors the other __builtin functions (which in
the Linux kernel are often called together: __builtin_frame_address() and
__builtin_return_address()

For instance:

        frame.fp = (unsigned long)__builtin_frame_address(0);
        frame.sp = current_sp;

        frame.lr = (unsigned long)__builtin_return_address(0);

        frame.pc = (unsigned long)return_address;

No doubt there is a more optimal way than the patch I sent.

I don't think you've answered my main question: what semantics do you want
the returned pointer to have (and how can any correct code ever do anything
with the pointer)? If you just want some approximation of the stack pointer
to use as a key, __builtin_frame_address seems like it would work. Do you
*also* need a guarantee that the stack pointer will not change between the
call to __builtin_stack_address and the end of the function (except in
callees)? Can we reorder a call to alloca() past __builtin_stack_address?
Can we reorder the initialization of a VLA past a call to
__builtin_stack_address? Can the backend choose to perform a stack
adjustment afterwards?

I don't think you've answered my main question: what semantics do you want
the returned pointer to have (and how can any correct code ever do anything
with the pointer)? If you just want some approximation of the stack pointer
to use as a key, __builtin_frame_address seems like it would work.

linux can process IRQs on a separate kernel stack (instead of the normal
per-process kernel stacks) in order the reduce the chances of kernel stack
overflows (they're very limited in size). now linux also wants to be able
to walk across these stacks when producing a diagnostic backtrace (that may
very well happen while processing an IRQ on its own stack). different archs
solve this problem of finding one stack from another in different ways.

specifically the i386 kernel (arch/x86/kernel/irq_32.c) saves the 'current
stack pointer' value somewhere on the IRQ stack (before switching the esp
register, which is done in asm) and the backtrace code (that loop in dump_trace
in arch/x86/kernel/dumpstack_32.c) 'knows' where to find it and continues
the trace from this saved stack pointer value. all this logic probably breaks
the standard in a dozen ways so strictly speaking it's nowhere near compliant
but 'it works' (with gcc anyway ;). (as a sidenote, there's more stack walking
code in linux that assumes certain compiler behaviour but let's fix one problem
at a time)

now with that background let me try to answer your questions:

- __builtin_frame_address is indeed good enough for this purpose (and i can't
  find more use of the stack register in C, but maybe Behan knows of more where
  an exact value is important)

- the resulting value is expected to be a valid address (at the time it's taken)
  as the kernel *will* dereference it later but there're no other requirements.

Do you *also* need a guarantee that the stack pointer will not change
between the call to __builtin_stack_address and the end of the function
(except in callees)?

no, it just has to be an address that belongs to the current stack frame at the
time. basically it'll define the start address from which the kernel will look
for certain things (such as code addresses in the hope that saved return addresses
will be found this way) on the stack.

Can we reorder a call to alloca() past __builtin_stack_address?
Can we reorder the initialization of a VLA past a call to
__builtin_stack_address? Can the backend choose to perform a stack
adjustment afterwards?

yes for all.

cheers,
  PaX Team

> I don't think you've answered my main question: what semantics do you
want
> the returned pointer to have (and how can any correct code ever do
anything
> with the pointer)? If you just want some approximation of the stack
pointer
> to use as a key, __builtin_frame_address seems like it would work.

linux can process IRQs on a separate kernel stack (instead of the normal
per-process kernel stacks) in order the reduce the chances of kernel stack
overflows (they're very limited in size). now linux also wants to be able
to walk across these stacks when producing a diagnostic backtrace (that may
very well happen while processing an IRQ on its own stack). different archs
solve this problem of finding one stack from another in different ways.

specifically the i386 kernel (arch/x86/kernel/irq_32.c) saves the 'current
stack pointer' value somewhere on the IRQ stack (before switching the esp
register, which is done in asm) and the backtrace code (that loop in
dump_trace
in arch/x86/kernel/dumpstack_32.c) 'knows' where to find it and continues
the trace from this saved stack pointer value. all this logic probably
breaks
the standard in a dozen ways so strictly speaking it's nowhere near
compliant
but 'it works' (with gcc anyway ;). (as a sidenote, there's more stack
walking
code in linux that assumes certain compiler behaviour but let's fix one
problem
at a time)

now with that background let me try to answer your questions:

- __builtin_frame_address is indeed good enough for this purpose (and i
can't
  find more use of the stack register in C, but maybe Behan knows of more
where
  an exact value is important)

That seems like a good answer, if it works. It seems like we could choose
to copy everything on the current stack frame into some global storage and
back around any call to __builtin_stack_address, and thus one possible
correct implementation would be to always return the frame pointer.

- the resulting value is expected to be a valid address (at the time it's

taken)
  as the kernel *will* dereference it later but there're no other
requirements.

Do you *also* need a guarantee that the stack pointer will not change
> between the call to __builtin_stack_address and the end of the function
> (except in callees)?

no, it just has to be an address that belongs to the current stack frame
at the
time. basically it'll define the start address from which the kernel will
look
for certain things (such as code addresses in the hope that saved return
addresses
will be found this way) on the stack.

> Can we reorder a call to alloca() past __builtin_stack_address?
> Can we reorder the initialization of a VLA past a call to
> __builtin_stack_address? Can the backend choose to perform a stack
> adjustment afterwards?

yes for all.

Thanks. Seems like the kernel can rely on being able to read through
pointers that used to point to the stack because it knows that the readable
portion of the stack never shrinks, right? (This could go wrong for
programs using segmented stacks.)

> now with that background let me try to answer your questions:
>
> - __builtin_frame_address is indeed good enough for this purpose (and i
> can't
> find more use of the stack register in C, but maybe Behan knows of more
> where
> an exact value is important)
>

That seems like a good answer, if it works. It seems like we could choose
to copy everything on the current stack frame into some global storage and
back around any call to __builtin_stack_address, and thus one possible
correct implementation would be to always return the frame pointer.

yes, the frame pointer works but please make sure that you can compute it
even with -fomit-frame-pointer (i.e., when there's no explicit hardware
register assigned for this purpose) as the i386 kernel is often compiled
with it (and which is why this manual stack walking code exists in the
first place, otherwise the stack walker 'knows' to follow ebp, the register
for the frame pointer).

Thanks. Seems like the kernel can rely on being able to read through
pointers that used to point to the stack because it knows that the readable
portion of the stack never shrinks, right? (This could go wrong for
programs using segmented stacks.)

exactly. in fact, the 'used to point to' part is technically not true
because for this i386 backtrace code the kernel knows that the 'other'
kernel stack is still valid and accessible in memory. this is because
at the time the backtrace code is called, the kernel knows that the
following events occured:

1. cpu entered the kernel (syscall, exception, etc) and is running on
   the kernel stack assigned to the current process. on i386 it's a
   fixed size of 8kB (2 pages) and is also aligned to 8kB, its lifetime
   is that of the corresponding userland process.

2. an IRQ occured and the kernel switched to an interrupt stack. at this
   point the calling context on the process' kernel stack is all valid
   and this is the time when the kernel needs 'something' to be able to
   find it later. the closer this something is to the last activated frame
   on the process' kernel stack, the better (more faithful) the backtrace
   will be later.

3. now that the cpu is on the interrupt stack, an unexpected event occurs,
   say an unservicable page fault, or some debug facility detects something
   (lockdep violation, memory leak, whatnot). this is the time when the
   kernel's diagnostic logic wants to print a backtrace, which at this point
   includes the stack frames on both the interrupt stack *and* the process'
   kernel stack since they're all live. finding the stack frames on the
   interrupt stack is 'easy' but finding the other stack requires explicit
   management (the topic of this discussion).

now all this is very linux (and kernel context) specific so i don't know if
you really need to worry about other use cases. of course there's the more
generic gcc feature of being able to assign other registers to variables,
not just the stack pointer. i don't know if clang/llvm wants to go there
as it then brings up the whole -ffixed-REG/-fcall-saved-REG/-fcall-used-REG
business as well ;).

cheers,
  PaX Team

We already have -ffixed-r9 for ARM for similar reasons, but I think we
would require an extensive discussion (like this one) to define if we want
to have those options, lots of built-in intrinsics, or both. But, as you
said, one thing at a time... :wink:

I think there are clear reasons for Clang/LLVM to support the kernel, the
issue is just finding the best solution for each problem. Because the
kernel and GCC walked so closely for decades, some of the solutions weren't
taken because it was the best, but because it was the simplest. If we do
the same now, we'll never get rid of bad code on both kernels and compilers.

The GCC folks are also taking a similar approach nowadays, and would be
good to know what they think on the problems we're solving here.

cheers,
--renato

now with that background let me try to answer your questions:

- __builtin_frame_address is indeed good enough for this purpose (and i can't
  find more use of the stack register in C, but maybe Behan knows of more where
  an exact value is important)

It is "good enough" in the case where you're merely trying to find the
beginning or end of the stack, however it's not acceptable to kernel
upstream (precisely for situations where there is no frame pointer
register). I've tried that already. They want to use the stack pointer.

In the other use case, where the stack pointer is saved for later it is
not good enough however.

- the resulting value is expected to be a valid address (at the time it's taken)
  as the kernel *will* dereference it later but there're no other requirements.

Exactly.

Do you *also* need a guarantee that the stack pointer will not change
between the call to __builtin_stack_address and the end of the function
(except in callees)?

Actually I'm proposing calling it __builtin_stack_pointer() The reason
being that we're accessing the "esp" register on x86 and the "sp"
register on ARM. The register is generally known as the "stack pointer"
which seems the best thing to call it. (Principle of least surprise).

no, it just has to be an address that belongs to the current stack frame at the
time. basically it'll define the start address from which the kernel will look
for certain things (such as code addresses in the hope that saved return addresses
will be found this way) on the stack.

I haven't actually looked further into the stack walking code. All I
know is that the return address, frame address, and stack pointer are
all saved for later use in multiple places. I'm trying not to break how
it currently works with gcc, as well as have it work with clang using
the same code.

Can we reorder a call to alloca() past __builtin_stack_address?
Can we reorder the initialization of a VLA past a call to
__builtin_stack_address? Can the backend choose to perform a stack
adjustment afterwards?

yes for all.

Agreed.

Incidentally, the LLVMLinux project is working on a ::stackpointer()
version of the __builtin_stack_pointer() patch (as suggested by Chris)
which we're just testing.

Thanks,

Behan

can you tell me where a precise stack pointer value is needed? the i386 stack
walker code definitely does not need one (sampling 'esp' is not precise anyway
since its value can change within a function) neither does current_thread_info.

cheers,
PaX Team

I have reimplemented __builtin_stack_pointer() on a new llvm::stackpointer intrinsic. I wasn’t sure if it was appropriate to post the code to this email list, and all other recipients but I wanted to know if this was an acceptable approach, so I am posting links to the patches below:

LLVM patch:

http://git.linuxfoundation.org/?p=llvmlinux.git;a=blob;f=toolchain/clang/patches/llvm/builtin_stack_pointer-new.patch;h=1b8a5d707975704d31485db4cc94fe0315884619;hb=HEAD

Clang patch:
http://git.linuxfoundation.org/?p=llvmlinux.git;a=blob;f=toolchain/clang/patches/clang/builtin_stack_pointer-new.patch;h=c0dbeea1564a3f4521dc79c083f456bf444a07f8;hb=HEAD

Please let me know the appropriate place to submit the patches or if they need to be further enhanced. I have a bad feeling the changes I added to LegalizeDAG.cpp are not needed, but I was basing my changes on ISD::RETURNADDR;

I have implemented llvm:stackpointer for ARM, AArch64 and X86.

I included tests for ARM and AArch64 but I have no idea how to write the test for X86.

Getting the stack pointer in an efficient way is not possible with clang today, but can be done with GCC. I hope this approach can facilitate getting the stack pointer efficiently via both compilers.

Thanks,

Mark

Hi Mark,

There are two ways:
1. Send the patches to the list, attached to your email. Copy llvm-commits
and cfe-commits, so that we know they're related and Clang folks can see
the LLVM side and vice versa.
http://llvm.org/docs/DeveloperPolicy.html#making-a-patch

2. Use Phabricator, add the main folks in the discussion to "reviewer"
roles, the others to CC and always remember to copy
cfe-commits/llvm-commits as appropriate. Emails will be sent automatically
and we can do review on a per-line basis.
http://llvm-reviews.chandlerc.com/

cheers,
--renato