Broke my tail (call)

I have written a variety tests of tail calls for my HLVM and all passed with
flying colors until I wrote this test (which is actually for algebraic
datatypes) and discovered that it segfaults after ~100k iterations through
what I think should be a tail call. Here's the IR:

define fastcc { { i8*, i8* }*, i8* } @init({ { i8*, i8* }*, i8* }, i32) {
entry:
        %2 = alloca { i32, { { i8*, i8* }*, i8* } } ; <{ i32, { {
i8
*, i8* }*, i8* } }*> [#uses=3]
        %3 = alloca { { i8*, i8* }*, i8* } ; <{ { i8*, i8* }*,
i8*
}*> [#uses=3]
        br label %start

start: ; preds = %entry
        %4 = getelementptr { i32, { { i8*, i8* }*, i8* } }* %2, i32 0, i32 0
        ; <i32*> [#uses=1]
        store i32 %1, i32* %4
        %5 = getelementptr { i32, { { i8*, i8* }*, i8* } }* %2, i32 0, i32 1
        ; <{ { i8*, i8* }*, i8* }*> [#uses=1]
        store { { i8*, i8* }*, i8* } %0, { { i8*, i8* }*, i8* }* %5
        %6 = getelementptr { i32, { { i8*, i8* }*, i8* } }* %2, i32 0
; <{ i32, { { i8*, i8* }*, i8* } }*> [#uses=1]
        %7 = load { i32, { { i8*, i8* }*, i8* } }* %6 ; <{ i32, { {
i8
*, i8* }*, i8* } }> [#uses=1]
        %8 = malloc { i32, { { i8*, i8* }*, i8* } } ; <{ i32, { {
i8
*, i8* }*, i8* } }*> [#uses=2]
        %9 = getelementptr { i32, { { i8*, i8* }*, i8* } }* %8, i32 0
; <{ i32, { { i8*, i8* }*, i8* } }*> [#uses=1]
        store { i32, { { i8*, i8* }*, i8* } } %7, { i32, { { i8*, i8* }*,
i8* }
}* %9
        %10 = getelementptr { { i8*, i8* }*, i8* }* %3, i32 0, i32 0
; <{ i8*, i8* }**> [#uses=1]
        store { i8*, i8* }* @Cons, { i8*, i8* }** %10
        %11 = bitcast { i32, { { i8*, i8* }*, i8* } }* %8 to i8*
; <i8*> [#uses=1]
        %12 = getelementptr { { i8*, i8* }*, i8* }* %3, i32 0, i32 1
; <i8**> [#uses=1]
        store i8* %11, i8** %12
        %13 = getelementptr { { i8*, i8* }*, i8* }* %3, i32 0 ; <{ {
i
8*, i8* }*, i8* }*> [#uses=1]
        %14 = load { { i8*, i8* }*, i8* }* %13 ; <{ { i8*, i8* }*,
i8*
}> [#uses=2]
        %15 = icmp eq i32 %1, 0 ; <i1> [#uses=1]
        br i1 %15, label %pass, label %fail

fail: ; preds = %start
        %16 = sub i32 %1, 1 ; <i32> [#uses=1]
        %17 = tail call fastcc { { i8*, i8* }*, i8* } @init({ { i8*, i8* }*,
i8*
} %14, i32 %16) ; <{ { i8*, i8* }*, i8* }> [#uses=1]
        ret { { i8*, i8* }*, i8* } %17

pass: ; preds = %start
        ret { { i8*, i8* }*, i8* } %14
}

Am I going mad or should that tail call three lines up not be leaking stack
space?

The only possible explanation I can think of is that LLVM believes it cannot
make this a tail call because it thinks it is passing a pointer to a struct
that is local to the caller. Is that correct and, if so, how can i work
around it?

I just noticed that I am accidentally returning a struct rather than going via
a pointer in the first argument. Might that be related?

Hi Jon,

I have written a variety tests of tail calls for my HLVM and all passed with
flying colors until I wrote this test (which is actually for algebraic
datatypes) and discovered that it segfaults after ~100k iterations through
what I think should be a tail call. Here's the IR:

is this really a tail call? I didn't look closely but at a glance it seems to
be passing a local stack variable as a call parameter.

Ciao,

Duncan.

Hi Jon,

> I have written a variety tests of tail calls for my HLVM and all passed
> with flying colors until I wrote this test (which is actually for
> algebraic datatypes) and discovered that it segfaults after ~100k
> iterations through what I think should be a tail call. Here's the IR:

is this really a tail call?

From what I have understood of the LLVM docs about when tail calls get

eliminated on x86 and x64 it should be a tail call, yes.

  http://llvm.org/docs/CodeGenerator.html#tailcallopt

. Caller and callee have the calling convention fastcc.
. The call is a tail call - in tail position (ret immediately follows call and
ret uses value of call or is void).
. Option -tailcallopt is enabled.
. No variable argument lists are used.
. On x86-64 when generating GOT/PIC code only module-local calls (visibility =
hidden or protected) are supported.

Those are all satisfied.

I didn't look closely but at a glance it seems to be passing a local stack
variable as a call parameter.

In this case, the arguments are a { { i8*, i8* }*, i8* } and a i32. As I
understand it, first-class structs are simply unpacked for argument passing
so that is equivalent to passing { i8*, i8* }* and i8* and i32. In this case,
the first is a pointer to a global variable and the second is a pointer to a
malloc'd block. So I don't see why any of the arguments should be inhibiting
tail call elimination.

I just tested my theory that returning a first-class struct from a function
inhibits tail call elimination and it seems that I was correct: altering this
function to pass its return struct by pointer in the first argument fixes the
stack overflow.

Is this a bug in LLVM?

Hi Jon,

>From what I have understood of the LLVM docs about when tail calls get
eliminated on x86 and x64 it should be a tail call, yes.

  http://llvm.org/docs/CodeGenerator.html#tailcallopt

. Caller and callee have the calling convention fastcc.
. The call is a tail call - in tail position (ret immediately follows call and
ret uses value of call or is void).
. Option -tailcallopt is enabled.
. No variable argument lists are used.
. On x86-64 when generating GOT/PIC code only module-local calls (visibility =
hidden or protected) are supported.

Those are all satisfied.

this list is for the code generator, and it seems obviously incomplete: it makes
no mention of local variables (alloca). Probably it is implicitly assuming that
the call was marked "tail call" by the LLVM optimizers. So you also need to check
under what conditions the LLVM optimizers do that.

> I didn't look closely but at a glance it seems to be passing a local stack
> variable as a call parameter.

In this case, the arguments are a { { i8*, i8* }*, i8* } and a i32.

Maybe, but I'm pretty sure at least one of these values was calculated by
mucking around with allocas. Did you add the "tail call" mark yourself?
If not, try removing it, and see if the LLVM optimizers add it back.

I just tested my theory that returning a first-class struct from a function
inhibits tail call elimination and it seems that I was correct: altering this
function to pass its return struct by pointer in the first argument fixes the
stack overflow.

Is this a bug in LLVM?

Could be, but first I'd like to be sure that you are not misusing tail calls.

Ciao,

Duncan.

Hello Duncan and Jon,

I am the criminal responsible for the tail call implementation in the backends.

Hi Jon,

>From what I have understood of the LLVM docs about when tail calls get
eliminated on x86 and x64 it should be a tail call, yes.

See below.

this list is for the code generator, and it seems obviously incomplete: it makes
no mention of local variables (alloca). Probably it is implicitly assuming that
the call was marked "tail call" by the LLVM optimizers. So you also need to check
under what conditions the LLVM optimizers do that.

The backend does not check the following condition but assumes it to be true.

From <http://llvm.org/docs/LangRef.html#i_call&gt;:

The optional "tail" marker indicates whether the callee function
accesses any allocas or varargs in the caller. If the "tail" marker is
present, the function call is eligible for tail call optimization.
Note that calls may be marked "tail" even if they do not occur before
a ret instruction.

If the llvm IR is generated by a front end (HLVM) the front end must
guarantee this condition.

> I didn't look closely but at a glance it seems to be passing a local stack
> variable as a call parameter.

To me, also only glancing over the code and with a bit rusty llvm ir
knowledge, it seems that %14 (the value passed to the tail call)is
allocated in the stack frame (alloca) of the calling function. So the
above condition is violated.

%3 = alloca { { i8*, i8* }*, i8* }
%13 = getelementptr { { i8*, i8* }*, i8* }* %3, i32 0
%14 = load { { i8*, i8* }*, i8* }* %13
%17 = tail call fastcc { { i8*, i8* }*, i8* }
     @init({ { i8*, i8* }*,i8*} %14, i32 %16)

Is this a bug in LLVM?

Could be, but first I'd like to be sure that you are not misusing tail calls.

Assuming that i am right and a pointer to alloca'ed memory is passed
this is not a bug but a misuse of the 'tail' instruction.

Maybe we should add a link or a note on
<http://llvm.org/docs/CodeGenerator.html#tailcallopt&gt; to
<http://llvm.org/docs/LangRef.html#i_call&gt; that explains that the call
can only be marked as 'tail' if no pointers to alloca'ed memory are
passed.

regards arnold

Hello Duncan and Jon,

I am the criminal responsible for the tail call implementation in the
backends.

Thanks! :slight_smile:

> Hi Jon,
>
>> >From what I have understood of the LLVM docs about when tail calls get
>>
>> eliminated on x86 and x64 it should be a tail call, yes.

See below.

> this list is for the code generator, and it seems obviously incomplete:
> it makes no mention of local variables (alloca). Probably it is
> implicitly assuming that the call was marked "tail call" by the LLVM
> optimizers. So you also need to check under what conditions the LLVM
> optimizers do that.

The backend does not check the following condition but assumes it to be
true.

>From <http://llvm.org/docs/LangRef.html#i_call&gt;:

The optional "tail" marker indicates whether the callee function
accesses any allocas or varargs in the caller. If the "tail" marker is
present, the function call is eligible for tail call optimization.
Note that calls may be marked "tail" even if they do not occur before
a ret instruction.

If the llvm IR is generated by a front end (HLVM) the front end must
guarantee this condition.

Right, that is in agreement with my observations:

1. Blindly setting "tail" everywhere in HLVM breaks some of LLVM's
optimization passes (which blindly assume it to be correct) and culminates in
the generation of broken code.

2. Passing a reference to a local struct on the current stack frame through a
tail call by reference is *not* detected as an error by LLVM and produces
code that corrupts data by overwriting the memory locations of the struct
from the caller with whatever the callee uses the same stack frame for.

>> > I didn't look closely but at a glance it seems to be passing a local
>> > stack variable as a call parameter.

To me, also only glancing over the code and with a bit rusty llvm ir
knowledge, it seems that %14 (the value passed to the tail call)is
allocated in the stack frame (alloca) of the calling function. So the
above condition is violated.

%3 = alloca { { i8*, i8* }*, i8* }
%13 = getelementptr { { i8*, i8* }*, i8* }* %3, i32 0
%14 = load { { i8*, i8* }*, i8* }* %13
%17 = tail call fastcc { { i8*, i8* }*, i8* }
     @init({ { i8*, i8* }*,i8*} %14, i32 %16)

I do not understand. Surely you should be testing for structs locally
allocated on the caller's frame being passed *by reference* to the callee? In
this case, the struct is passed by value so the fact that it came from the
stack should be irrelevant: that stack space is not being depended upon
because the fields of the struct are now in registers.

This is all swinging on my interpretation of first-class structs though. I am
under the impression that they are essentially a set of registers holding
their first-class fields. So you can pass them by value as arguments to
functions with the behaviour of any other first-class type regardless of
where they came from.

>> Is this a bug in LLVM?
>
> Could be, but first I'd like to be sure that you are not misusing tail
> calls.

Assuming that i am right and a pointer to alloca'ed memory is passed
this is not a bug but a misuse of the 'tail' instruction.

I disagree. I think the fields of the struct should be passed by value with no
pointer to the original alloca'd data.

Moreover, I now have evidence that LLVM is not behaving as you expect:

3. Adjusting the return value from this function into sret form results in
tail call elimination being performed correctly. Note that this is still
passing a first-class struct by value as an argument to a function through a
tail call.

Here is my code that does get tail call eliminated correctly:

define fastcc i32 @init({ i8*, i8* }*, { i8*, i8* }, i32) {
entry:
  %3 = alloca { i32, { i8*, i8* } } ; <{ i32, { i8*, i8* } }*> [#uses=3]
  %4 = alloca { i8*, i8* } ; <{ i8*, i8* }*> [#uses=3]
  br label %start

start: ; preds = %entry
  %5 = bitcast { i32, { i8*, i8* } }* %3 to i32* ; <i32*> [#uses=1]
  store i32 %2, i32* %5
  %6 = getelementptr { i32, { i8*, i8* } }* %3, i32 0, i32 1 ; <{ i8*, i8* }*>
[#uses=1]
  store { i8*, i8* } %1, { i8*, i8* }* %6
  %7 = bitcast { i32, { i8*, i8* } }* %3 to { i32, { i8*, i8* } }* ; <{ i32, {
i8*, i8* } }*> [#uses=1]
  %8 = load { i32, { i8*, i8* } }* %7 ; <{ i32, { i8*, i8* } }> [#uses=1]
  %9 = malloc { i32, { i8*, i8* } } ; <{ i32, { i8*, i8* } }*> [#uses=2]
  %10 = bitcast { i32, { i8*, i8* } }* %9 to { i32, { i8*, i8* } }* ; <{ i32,
{ i8*, i8* } }*> [#uses=1]
  store { i32, { i8*, i8* } } %8, { i32, { i8*, i8* } }* %10
  %11 = bitcast { i32, { i8*, i8* } }* %9 to i8* ; <i8*> [#uses=1]
  %12 = bitcast { i8*, i8* }* %4 to i8** ; <i8**> [#uses=1]
  store i8* bitcast ({ i8*, i32 ({ i8*, i8* })* }* @Cons to i8*), i8** %12
  %13 = getelementptr { i8*, i8* }* %4, i32 0, i32 1 ; <i8**> [#uses=1]
  store i8* %11, i8** %13
  %14 = bitcast { i8*, i8* }* %4 to { i8*, i8* }* ; <{ i8*, i8* }*> [#uses=1]
  %15 = load { i8*, i8* }* %14 ; <{ i8*, i8* }> [#uses=2]
  %16 = icmp eq i32 %2, 0 ; <i1> [#uses=1]
  br i1 %16, label %pass, label %fail

fail: ; preds = %start
  %17 = sub i32 %2, 1 ; <i32> [#uses=1]
  %18 = tail call fastcc i32 @init({ i8*, i8* }* %0, { i8*, i8* } %15,
i32 %17) ; <i32> [#uses=1]
  ret i32 %18

pass: ; preds = %start
  %19 = bitcast { i8*, i8* }* %0 to { i8*, i8* }* ; <{ i8*, i8* }*> [#uses=1]
  store { i8*, i8* } %15, { i8*, i8* }* %19
  ret i32 0
}

Maybe we should add a link or a note on
<http://llvm.org/docs/CodeGenerator.html#tailcallopt&gt; to
<http://llvm.org/docs/LangRef.html#i_call&gt; that explains that the call
can only be marked as 'tail' if no pointers to alloca'ed memory are
passed.

The documentation should probably be clarified but I am leaning toward this
being a bug in LLVM itself: tail call elimination should be done when the
return value is a first-class struct.

Also, I said before that I had found first-class structs to be broken. In
retrospect, I think I was probably observing this phenomenon and seeing a
segmentation fault due to stack overflow caused by broken tail calls and not
broken structs.

This is all swinging on my interpretation of first-class structs though. I am
under the impression that they are essentially a set of registers holding
their first-class fields. So you can pass them by value as arguments to
functions with the behaviour of any other first-class type regardless of
where they came from.

This is not true in general and highly target- and CC- dependent. For
example, you can ran out of registers and then your struct can be passed
partly in registers and partly on stack. And depending on the stack
frame size of the callee you can easily get infinite stack growth.

As for returning struct-by-value - I think the situation is more
complicated and you can expect only "small structs" to behave so (small
= fits in the single target register)

Yes i was wrong the problem is of a different kind and has nothing to
do with the alloca. As you say the struct is passed by value. This is
not the problem. For example the following code (not containing any
allocas) will also not be tail call optimized.

define fastcc { { i8*, i8* }*, i8*} @init({ { i8*, i8* }*, i8*}, i32) {
entry:
       %2 = tail call fastcc { { i8*, i8* }*, i8* } @init({ { i8*, i8*
}*, i8*} %0, i32 %1)
       ret { { i8*, i8* }*, i8*} %2
}

The problem has to do with how struct returns are represented
internally by llvm (in the SelectionDAG) and how the tail call
optimization implementation checks if it may perform the tail call.
The implementation checks that the <call> node is immediately followed
by a <ret> node. A struct return causes the insertion of a
<merge_values> node between the tail call instruction and the return
instruction node. Because of the intermediate node the backend
believes it must not optimize the tail call.

[result1, result2] = <call ... >
[merged] = <merge_values [result1, result2]>
<ret [merged]>

So the current situation is that tail calls on functions that return a
struct are not optimized (as you correctly observed ;). Sorry.

regards arnold

There is a sense in which it is true -- first-class structs are converted
to a set of *virtual* registers holding their first-class fields, which may
of course be passed on the stack when physical registers run short,
in exactly the same manner as with lots of scalar arguments.

Dan

Exactly.

Thanks for the clarification. That makes a lot more sense!

LLVM's support for structs is wonderful but I don't think they can be
called "first-class structs" until all such arbitrary restrictions have been
removed, even though the workaround (using sret form) is trivial in this
case.

Shall I file this as a bug against LLVM?

Assuming this gets fixed, will it be more efficient that the sret form?

In case of tailcall(+fastcc?), can't this be done in constant stackspace?
First time you allocate stackspace to pass the parameters, then all
other tailcalls will reuse the stackspace
that it got its parameters in, to pass parameters to the next tailcall.

Best regards
--Edwin

Hi,

Does anyone know an easy way to do same optimisations as 'opt -O2' or 'opt -O3' in the JIT.

I can do the tedious way:

  pass_manager->add(createInstructionCombiningPass());
  pass_manager->add(createGVNPass());
  etc.
  etc.

but I would rather reuse the code in 'opt' to manage the passes.

Cheers,
Mark.

Yes and indeed it already does.

define fastcc void @init2({ i32, i32, i32, i32}, i32) {
entry:
       tail call fastcc void @init2({ i32, i32, i32, i32} %0, i32 %1)
       ret void
}

results in (admittedly a little to eager to move arguments around )
tail call version:

Llabel1:
        subl $12, %esp
        movl 24(%esp), %eax
        movl %eax, 24(%esp)
        movl 20(%esp), %eax
        movl %eax, 20(%esp)
        movl 16(%esp), %eax
        movl %eax, 16(%esp)
        addl $12, %esp
        jmp _init2 # TAILCALL
Leh_func_end1:

The problem Jon is seeing has to do with the struct return. not the
struct argument passing as i wrongly assumed at first.

regards arnold

Thanks for the clarification. That makes a lot more sense!

LLVM's support for structs is wonderful but I don't think they can be
called "first-class structs" until all such arbitrary restrictions have been
removed, even though the workaround (using sret form) is trivial in this
case.

Shall I file this as a bug against LLVM?

Yes please do and maybe include the following code which already
exhibits the error in the bug report.

define fastcc { { i8*, i8* }*, i8*} @init({ { i8*, i8* }*, i8*}, i32) {
entry:
      %2 = tail call fastcc { { i8*, i8* }*, i8* } @init({ { i8*,
i8*}*, i8*} %0, i32 %1)
      ret { { i8*, i8* }*, i8*} %2
}

Assuming this gets fixed, will it be more efficient that the sret form?

Don't expect this to be fixed soon as i am currently on a leave of
absence from llvm busy with bringing tail calls to another vm and
writing a thesis about it ;).

Whether it will be more efficient i can't answer off hand. Sorry. But
probably not because the code generated should be quite similar.

For the sret version the move of the result is performed before the return.
  store { i8*, i8* } %15, { i8*, i8* }* %19
  ret i32 0

For the struct return version this would be performed as part of
moving the result from the result registers to whatever virtual
register is expecting the result. If the register allocator decides to
merge the virtual register with the result registers than no further
move is needed.
%2 = tail call fastcc { { i8*, i8* }*, i8* } @init({ { i8*, i8*}*,
i8*} %0, i32 %1)

Note that if you have a series of sequential recursive tail calls this
move will only performed once (at the bottom of the recursion,
respectively when the recursion returns) so it's impact on performance
should be minimal.

regards
arnold

Hmm, that makes it sound as though the moves between a tail call and the
following return are redundant?

No, but a patch to factor out the relevant code from opt.cpp so
that it can be used like this would be welcome. The code probably
has to go in an inline function in a header file to avoid library
dependency issues, but otherwise it should be straight-forward.

Dan

So the current situation is that tail calls on functions that return a
struct are not optimized (as you correctly observed ;). Sorry.

Thanks for the clarification. That makes a lot more sense!

LLVM's support for structs is wonderful but I don't think they can be
called "first-class structs" until all such arbitrary restrictions have been
removed, even though the workaround (using sret form) is trivial in this
case.

"first-class" in this context means "Values can have this type", not
"everything that anyone might want is supported in all backends."

A meta issue is that LLVM IR differs from many virtual machine ISAs in
that it's more of a language for expressing programs than a specification
of what some particular machine can actually execute. When a
particular LLVM component doesn't support a particular aspect of the
language, it's considered a bug in the component, and not necessarily
considered a problem in the language.

This arrangement is convenient for people developing LLVM itself,
because new language features can be added without the requirement
that all backends and execution engines support all features in full
generality, which isn't always immediately useful, and sometimes not
even meaningful. However, it's inconvenient for people using LLVM,
because there's no complete list of which features are actually
supported on any given backend. Some features that are commonly
unsupported are now mentioned LangRef.html, though it's certainly
not complete.

One way this issue might be addressed is with an LLVM IR
conformance test suite. Covering the whole language would be a
big project, but the work could be done incrementally. I wonder if
people would be interested in working on this.

Shall I file this as a bug against LLVM?

Yes, thanks.

Dan

0, i32 %1)

Note that if you have a series of sequential recursive tail calls this
move will only performed once (at the bottom of the recursion,
respectively when the recursion returns) so it's impact on performance
should be minimal.

Hmm, that makes it sound as though the moves between a tail call and the
following return are redundant?

I am not sure i understand what you mean by redundant.

What i was trying to say is that if you have

i32 a() {
  %1 = tailcall b()
   ret %1
}

i32 b() {
%1 = tailcall c()
  ret %1
}

i32 c() {
  %1 = tailcall d()
   ret %1
}

i32 d() {
  ret i32 5
}

only d() will actually perform the return i.e the move of the result
to register %eax on x86 or in case of a struct return the move to
whatever registers/(stackslots?) are used to return the elements of
the struct.

regards
arnold