Question about fastcc assumptions and seemingly superfluous %esp updates

Hello,

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32 %a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
  %q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
  %ww = sub i32 %q, 6
  ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
  movl 4(%esp), %eax
  ret $12

tailcaller: # @tailcaller
  subl $12, %esp
  movl %edx, 20(%esp)
  movl %ecx, 16(%esp)
  addl $12, %esp
  jmp tailcallee # TAILCALL

foo: # @foo
  subl $12, %esp
  movl 20(%esp), %ecx
  movl 16(%esp), %edx
  calll tailcaller
  subl $12, %esp
  addl $-6, %eax
  addl $12, %esp
  ret

A number of questions arise here:

1) Notice that 'tailcaller' goes beyond its own stack frame when
arranging arguments for 'tailcallee'. It subs 12 from %esp, but then
writes to 20(%esp). Clearly, something in the fastcc convention allows
it to assume that stack space will be available there? What is it?

2) Note the %esp dance 'tailcaller' is doing - completely useless sub
followed by add. Does this have an inherent goal or can it be
eliminated?

3) The %esp dance of 'foo' is even stranger:

  subl $12, %esp
  addl $-6, %eax
  addl $12, %esp

The subl and addl to %esp cancel out, and with an unrelated operation
in between. Why are they needed?

I'll be very grateful if someone could shed some light on this.

Eli

Hey Eli,

Hello,

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32 %a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
%q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
%ww = sub i32 %q, 6
ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
movl 4(%esp), %eax
ret $12

tailcaller: # @tailcaller
subl $12, %esp
movl %edx, 20(%esp)
movl %ecx, 16(%esp)
addl $12, %esp
jmp tailcallee # TAILCALL

foo: # @foo
subl $12, %esp
movl 20(%esp), %ecx
movl 16(%esp), %edx
calll tailcaller
subl $12, %esp
addl $-6, %eax
addl $12, %esp
ret

A number of questions arise here:

  1. Notice that ‘tailcaller’ goes beyond its own stack frame when
    arranging arguments for ‘tailcallee’. It subs 12 from %esp, but then
    writes to 20(%esp). Clearly, something in the fastcc convention allows
    it to assume that stack space will be available there? What is it?

It looks like your call is being converted to a tailcall. I agree that those stack writes are setting up the arguments for tailcallee. Although, I haven’t done the stack frame math to say for sure.

I suspect that this is legal since tailcallee is a leaf function and the writes are into the “red zone”.

  1. Note the %esp dance ‘tailcaller’ is doing - completely useless sub
    followed by add. Does this have an inherent goal or can it be
    eliminated?

  2. The %esp dance of ‘foo’ is even stranger:

subl $12, %esp
addl $-6, %eax
addl $12, %esp

The subl and addl to %esp cancel out, and with an unrelated operation
in between. Why are they needed?

I’m not an expert in this area, but I believe that “ret $12” cleans up the stack by adding 12 bytes to %esp; an artifact of the tailcall conversion. So,

subl $12, %esp <= Matches the “ret $12” from tailcallee’s epilogue.
addl $-6, %eax
addl $12, %esp <= Matches the “subl $12, %esp” from foo’s prologue.

I suppose they’re explicitly needed in case a stack operation occurs after the call and before the return. I wonder if the spiller has not run yet when the tailcall decision is made, or something similar.

-Cameron

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32
%a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
  %q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
  %ww = sub i32 %q, 6
  ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
  movl 4(%esp), %eax
  ret $12

tailcaller: # @tailcaller
  subl $12, %esp
  movl %edx, 20(%esp)
  movl %ecx, 16(%esp)
  addl $12, %esp
  jmp tailcallee # TAILCALL

foo: # @foo
  subl $12, %esp
  movl 20(%esp), %ecx
  movl 16(%esp), %edx
  calll tailcaller
  subl $12, %esp
  addl $-6, %eax
  addl $12, %esp
  ret

A number of questions arise here:

1) Notice that 'tailcaller' goes beyond its own stack frame when
arranging arguments for 'tailcallee'. It subs 12 from %esp, but then
writes to 20(%esp). Clearly, something in the fastcc convention allows
it to assume that stack space will be available there? What is it?

It looks like your call is being converted to a tailcall. I agree that those
stack writes are setting up the arguments for tailcallee. Although, I
haven't done the stack frame math to say for sure.

I suspect that this is legal since tailcallee is a leaf function and the
writes are into the "red zone".

Thanks for answering, Cameron.

I don't think this is red-zone related, because the (1) red-zone is in
the callee's, not caller's stack frame (i.e. it's *below* the return
address) and (2) red-zone is x86-64 specific and this code is
generated for 32-bit x86.

The math is pretty simple here. tailcaller gets two int arguments,
both passed on the stack (fastcc). So when it's entered there's only
the return address on stack. It subs 12 from the %esp but then writes
into 20(%esp), which is above the return address and hence in its
caller's frame.

2) Note the %esp dance 'tailcaller' is doing - completely useless sub
followed by add. Does this have an inherent goal or can it be
eliminated?

3) The %esp dance of 'foo' is even stranger:

  subl $12, %esp
  addl $-6, %eax
  addl $12, %esp

The subl and addl to %esp cancel out, and with an unrelated operation
in between. Why are they needed?

I'm not an expert in this area, but I believe that "ret $12" cleans up the
stack by adding 12 bytes to %esp; an artifact of the tailcall conversion.
So,

  subl $12, %esp <= Matches the "ret $12" from tailcallee's epilogue.
  addl $-6, %eax
  addl $12, %esp <= Matches the "subl $12, %esp" from foo's prologue.

I suppose they're explicitly needed in case a stack operation occurs after
the call and before the return. I wonder if the spiller has not run yet when
the tailcall decision is made, or something similar.

Yep, I agree about their purpose. It's just that they could (and
should) have been optimized away, I think.

Eli

When you enable -tailcallopt you get support for tail calls between functions with arbitrary stack space requirements. That means the calling convention has to change slightly. E.g the callee is responsible for removing it's arguments of the stack. The caller cannot transitively know the tail callee's tailcallee's requirement. Also care must be taken to make sure the stack stays aligned.

Hello,

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32 %a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
%q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
%ww = sub i32 %q, 6
ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
movl 4(%esp), %eax
ret $12

tailcaller: # @tailcaller
subl $12, %esp
movl %edx, 20(%esp)
movl %ecx, 16(%esp)
addl $12, %esp
jmp tailcallee # TAILCALL

foo: # @foo
subl $12, %esp
movl 20(%esp), %ecx
movl 16(%esp), %edx
calll tailcaller
subl $12, %esp
addl $-6, %eax
addl $12, %esp
ret

A number of questions arise here:

1) Notice that 'tailcaller' goes beyond its own stack frame when
arranging arguments for 'tailcallee'. It subs 12 from %esp, but then
writes to 20(%esp). Clearly, something in the fastcc convention allows
it to assume that stack space will be available there? What is it?

It writes to its incoming parameter space. When you tail call your callers outgoing parameter area is your outgoing parameter area. Now you are going to say: Wait a minute the required space for tail caller is zero! And you are right, there is a bug in the code that computes the required space (which needs to be 16 byte aligned): see X86ISelLowering::GetAlignedArgumentStackSize

In general, the tail caller knows that there is space because it was called and its parameters were put there (plus some empty space to keep the stack 16byte aligned). To keep the stack aligned the parameter area changes in increments of 16. There was a bug (apparently, i did not upstream the fix for it :frowning: ) in the may we calculate this "adjustment" that would cause a 0 stack space to pump up the alignment by 12 (16 - return addr). It is safe (because we are calling this function consistently), though we are wasting stack space.

When we do the tail we ask: what is the required stack space of the caller and what is the required stack space of the callee. We subtract them. If the subtraction ends up to be zero you can just move the arguments, otherwise you have to adjust the stack(frame), possibly moving the return address around.

tailcaller f(i32,i32) -> bug: GetAlignedArgumentStackSize returns that there is space for three arguments on the stack (as you can see in foo this space was really allocated).

tailcallee f(i32,i32,i32,i32) -> also has space for three arguments on the stack

And there is another bug that causes the code to assume a tail call is a normal call and as such you end up with

tailcaller: # @tailcaller
subl $12, %esp
...
addl $12, %esp
jmp tailcallee

Again safe but not very efficient :).

2) Note the %esp dance 'tailcaller' is doing - completely useless sub
followed by add. Does this have an inherent goal or can it be
eliminated?

3) The %esp dance of 'foo' is even stranger:

subl $12, %esp
addl $-6, %eax
addl $12, %esp

Because of what I said in the beginning if a non fastcc function calls a fastcc function with tailcallopt on, it has do readjust the stack because the fastcc tail called function popped its arguments off the stack. Imagine you had two such call sites in a row.

The subl and addl to %esp cancel out, and with an unrelated operation
in between. Why are they needed?

I'll be very grateful if someone could shed some light on this.

Eli
_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

After rummaging around on one of my old machines I found this patch. I most have forgotten to commit this (this is from 2011 so I probably does not apply cleanly anymore).

tailcall_stack.patch (2.51 KB)

Hi Arnold,

Thanks for the insights. My comments below:

When you enable -tailcallopt you get support for tail calls between functions with arbitrary stack space requirements. That means the calling convention has to change slightly. E.g the callee is responsible for removing it's arguments of the stack. The caller cannot transitively know the tail callee's tailcallee's requirement. Also care must be taken to make sure the stack stays aligned.

Let me translate this to another phrasing to check my own
understanding: w.r.t. foo calling talcaller. Since tailcaller is
fastcc, when we generate code for it, we assume that the caller has
allocated the required stack space, so it's safe to write "into the
caller's frame". Even if it's not being "tail called" (like foo calls
it, a normal call) the caller still has to guarantee this invariant.
This is possible because when we generate the caller (foo) we know
that the call to tailcaller is fastcc. Is this right?

Hello,

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32 %a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
%q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
%ww = sub i32 %q, 6
ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
movl 4(%esp), %eax
ret $12

tailcaller: # @tailcaller
subl $12, %esp
movl %edx, 20(%esp)
movl %ecx, 16(%esp)
addl $12, %esp
jmp tailcallee # TAILCALL

foo: # @foo
subl $12, %esp
movl 20(%esp), %ecx
movl 16(%esp), %edx
calll tailcaller
subl $12, %esp
addl $-6, %eax
addl $12, %esp
ret

A number of questions arise here:

1) Notice that 'tailcaller' goes beyond its own stack frame when
arranging arguments for 'tailcallee'. It subs 12 from %esp, but then
writes to 20(%esp). Clearly, something in the fastcc convention allows
it to assume that stack space will be available there? What is it?

It writes to its incoming parameter space. When you tail call your callers outgoing parameter area is your outgoing parameter area. Now you are going to say: Wait a minute the required space for tail caller is zero! And you are right, there is a bug in the code that computes the required space (which needs to be 16 byte aligned): see X86ISelLowering::GetAlignedArgumentStackSize

In general, the tail caller knows that there is space because it was called and its parameters were put there (plus some empty space to keep the stack 16byte aligned). To keep the stack aligned the parameter area changes in increments of 16. There was a bug (apparently, i did not upstream the fix for it :frowning: ) in the may we calculate this "adjustment" that would cause a 0 stack space to pump up the alignment by 12 (16 - return addr). It is safe (because we are calling this function consistently), though we are wasting stack space.

So the right thing would be:
foo knows that tailcaller only takes two ints, and because of fastcc
these go in registers so no stack arguments are required. So it would
not allocate space for arguments. Hence, tailcaller can no longer
assume these exist, and in order to build the arguments for tailcallee
would need to allocate some stack space on its own. Right?

When we do the tail we ask: what is the required stack space of the caller and what is the required stack space of the callee. We subtract them. If the subtraction ends up to be zero you can just move the arguments, otherwise you have to adjust the stack(frame), possibly moving the return address around.

tailcaller f(i32,i32) -> bug: GetAlignedArgumentStackSize returns that there is space for three arguments on the stack (as you can see in foo this space was really allocated).

tailcallee f(i32,i32,i32,i32) -> also has space for three arguments on the stack

And there is another bug that causes the code to assume a tail call is a normal call and as such you end up with

tailcaller: # @tailcaller
subl $12, %esp
...
addl $12, %esp
jmp tailcallee

Again safe but not very efficient :).

2) Note the %esp dance 'tailcaller' is doing - completely useless sub
followed by add. Does this have an inherent goal or can it be
eliminated?

3) The %esp dance of 'foo' is even stranger:

subl $12, %esp
addl $-6, %eax
addl $12, %esp

Because of what I said in the beginning if a non fastcc function calls a fastcc function with tailcallopt on, it has do readjust the stack because the fastcc tail called function popped its arguments off the stack. Imagine you had two such call sites in a row.

Yep, but certainly the above can be optimized away? What would be the
right stage to do it? Some peephole pass after the calls/prologs are
emitted?

After rummaging around on one of my old machines I found this patch. I most have forgotten to commit this (this is from 2011 so I probably does not apply cleanly anymore).

This is excellent. If you want to re-send the patch cleanly done vs.
ToT with a test or two, I will gladly review it.

Eli

Hi Arnold,

Thanks for the insights. My comments below:

When you enable -tailcallopt you get support for tail calls between functions with arbitrary stack space requirements. That means the calling convention has to change slightly. E.g the callee is responsible for removing it's arguments of the stack. The caller cannot transitively know the tail callee's tailcallee's requirement. Also care must be taken to make sure the stack stays aligned.

Let me translate this to another phrasing to check my own
understanding: w.r.t. foo calling talcaller. Since tailcaller is
fastcc, when we generate code for it, we assume that the caller has
allocated the required stack space, so it's safe to write "into the
caller's frame". Even if it's not being "tail called" (like foo calls
it, a normal call) the caller still has to guarantee this invariant.
This is possible because when we generate the caller (foo) we know
that the call to tailcaller is fastcc. Is this right?

Yes. As a side effect of this you cannot have some modules compiled with -tailcallopt and others not :(. This is implicit given by the fact that -tailcall opt changes the calling convention (http://llvm.org/docs/CodeGenerator.html#tail-call-optimization). Rereading this I think we should make this very important point explicit :).

Hello,

While investigating one of the existing tests
(test/CodeGen/X86/tailcallpic2.ll), I ran into IR that produces some
interesting code. The IR is very straightforward:

define protected fastcc i32 @tailcallee(i32 %a1, i32 %a2, i32 %a3, i32 %a4) {
entry:
ret i32 %a3
}

define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
entry:
%tmp11 = tail call fastcc i32 @tailcallee( i32 %in1, i32 %in2, i32
%in1, i32 %in2)
ret i32 %tmp11
}

define i32 @foo(i32 %in1, i32 %in2) {
entry:
%q = call fastcc i32 @tailcaller(i32 %in2, i32 %in1)
%ww = sub i32 %q, 6
ret i32 %ww
}

Built with (ToT LLVM):
llc < ~/temp/z.ll -march=x86 -tailcallopt -O3

The produced code is (cleaned up a bit)

tailcallee: # @tailcallee
movl 4(%esp), %eax
ret $12

tailcaller: # @tailcaller
subl $12, %esp
movl %edx, 20(%esp)
movl %ecx, 16(%esp)
addl $12, %esp
jmp tailcallee # TAILCALL

foo: # @foo
subl $12, %esp
movl 20(%esp), %ecx
movl 16(%esp), %edx
calll tailcaller
subl $12, %esp
addl $-6, %eax
addl $12, %esp
ret

A number of questions arise here:

1) Notice that 'tailcaller' goes beyond its own stack frame when
arranging arguments for 'tailcallee'. It subs 12 from %esp, but then
writes to 20(%esp). Clearly, something in the fastcc convention allows
it to assume that stack space will be available there? What is it?

It writes to its incoming parameter space. When you tail call your callers outgoing parameter area is your outgoing parameter area. Now you are going to say: Wait a minute the required space for tail caller is zero! And you are right, there is a bug in the code that computes the required space (which needs to be 16 byte aligned): see X86ISelLowering::GetAlignedArgumentStackSize

In general, the tail caller knows that there is space because it was called and its parameters were put there (plus some empty space to keep the stack 16byte aligned). To keep the stack aligned the parameter area changes in increments of 16. There was a bug (apparently, i did not upstream the fix for it :frowning: ) in the may we calculate this "adjustment" that would cause a 0 stack space to pump up the alignment by 12 (16 - return addr). It is safe (because we are calling this function consistently), though we are wasting stack space.

So the right thing would be:
foo knows that tailcaller only takes two ints, and because of fastcc
these go in registers so no stack arguments are required. So it would
not allocate space for arguments. Hence, tailcaller can no longer
assume these exist, and in order to build the arguments for tailcallee
would need to allocate some stack space on its own. Right?

Yes. This will happen if the patch is applied.
You can also test this by having "tailcaller" call a function with say 6 parameters. If it does not happen - then we have a serious bug with -tailcallopt :).

When we do the tail we ask: what is the required stack space of the caller and what is the required stack space of the callee. We subtract them. If the subtraction ends up to be zero you can just move the arguments, otherwise you have to adjust the stack(frame), possibly moving the return address around.

tailcaller f(i32,i32) -> bug: GetAlignedArgumentStackSize returns that there is space for three arguments on the stack (as you can see in foo this space was really allocated).

tailcallee f(i32,i32,i32,i32) -> also has space for three arguments on the stack

And there is another bug that causes the code to assume a tail call is a normal call and as such you end up with

tailcaller: # @tailcaller
subl $12, %esp
...
addl $12, %esp
jmp tailcallee

Again safe but not very efficient :).

2) Note the %esp dance 'tailcaller' is doing - completely useless sub
followed by add. Does this have an inherent goal or can it be
eliminated?

3) The %esp dance of 'foo' is even stranger:

subl $12, %esp
addl $-6, %eax
addl $12, %esp

Because of what I said in the beginning if a non fastcc function calls a fastcc function with tailcallopt on, it has do readjust the stack because the fastcc tail called function popped its arguments off the stack. Imagine you had two such call sites in a row.

Yep, but certainly the above can be optimized away? What would be the
right stage to do it? Some peephole pass after the calls/prologs are
emitted?

I would have to dig into this part of the code again. But you could certainly do it in a post prolog-epilogue-insertion pass.

After rummaging around on one of my old machines I found this patch. I most have forgotten to commit this (this is from 2011 so I probably does not apply cleanly anymore).

This is excellent. If you want to re-send the patch cleanly done vs.
ToT with a test or two, I will gladly review it.

Sure will do.