opt -std-compile-opts breaks tail calls

Is "opt" expected to honour tail calls?

I don't believe tail call elimination is on by default. Did you try
including the "-tailcallelim" option?

tail calls are only implemented for fastcall calling convention if i remeber right from my inquiries.

I've run into some issues with tail calls in the past, make sure you
are doing the following:

1. Call should be marked with tail (obviously)
2. Next statement after tail call should be 'return void'
3. Use fast call convention for tail calls
4. Make sure the function you are calling doesn't use the 'noreturn' attribute.
5. Turn on tail calls in llc (if using the static compiler), with
'-tailcallopt' option.

Point 4 is the one that caused me trouble for some time. Unfortunately
it causes a bad interaction with the optimiser, specifically the
'simplifycfg' pass. What seems to happen is that since the function
you are calling is marked with 'noreturn', the simplifycfg pass will
then put a 'unreachable' statement after the tail call to that
function. This stuffs up the tail call though as the llc compiler
requires tail calls be followed by a 'return void' statement. In my
case this caused my compiled programs to segfault if the llvm
optimiser was run on them.

I haven't fully investigated this yet so not really sure if it is
specific to my case or applies universally but it sure did cause me
some pains for a while.

Resent, to the list this time instead of David (sorry for the duplicate)...

Yes, that is correct. You may like to look at the implementation in HLVM which
handles struct return types as well:

  http://hlvm.forge.ocamlcore.org/

I've run into some issues with tail calls in the past, make sure you
are doing the following:

1. Call should be marked with tail (obviously)
2. Next statement after tail call should be 'return void'
3. Use fast call convention for tail calls
4. Make sure the function you are calling doesn't use the 'noreturn'
attribute. 5. Turn on tail calls in llc (if using the static compiler),
with
'-tailcallopt' option.

Point 4 is the one that caused me trouble for some time. Unfortunately
it causes a bad interaction with the optimiser, specifically the
'simplifycfg' pass. What seems to happen is that since the function
you are calling is marked with 'noreturn', the simplifycfg pass will
then put a 'unreachable' statement after the tail call to that
function. This stuffs up the tail call though as the llc compiler
requires tail calls be followed by a 'return void' statement. In my
case this caused my compiled programs to segfault if the llvm
optimiser was run on them.

Isn't that a bug in LLVM?

I haven't fully investigated this yet so not really sure if it is
specific to my case or applies universally but it sure did cause me
some pains for a while.

HLVM is certainly abiding by the requirements for tail calls or they wouldn't
be working with the JIT and unoptimized static compilation. The problem only
appears when the code is run through "opt". However, I am assuming that this
segmentation fault is the result of a broken tail call. Perhaps it is
something else...

Yes, please file a bugzilla with a small example.

-Chris

I think I may be having a similar problem, albeit with with noalias rather
than noreturn. This is my tail recursive function:

define fastcc noalias i8* @fold_aux(%2*, i32, i8* (%2*, %2, double)*, %2, %0)
{
entry:
  %5 = load i32* @shadow_stack_depth ; <i32> [#uses=4]
  %6 = alloca %2, align 8 ; <%2*> [#uses=2]
  %7 = load %0* @shadow_stack ; <%0> [#uses=1]
  %8 = extractvalue %0 %7, 2 ; <i8*> [#uses=1]
  %9 = bitcast i8* %8 to %0* ; <%0*> [#uses=1]
  %10 = sext i32 %5 to i64 ; <i64> [#uses=1]
  %11 = getelementptr %0* %9, i64 %10 ; <%0*> [#uses=1]
  store %0 %4, %0* %11
  %12 = add i32 %5, 1 ; <i32> [#uses=1]
  store i32 %12, i32* @shadow_stack_depth
  %13 = load i32* @n_allocated ; <i32> [#uses=1]
  %14 = load i32* @quota ; <i32> [#uses=1]
  %15 = icmp sgt i32 %13, %14 ; <i1> [#uses=1]
  br i1 %15, label %fail, label %cont

fail: ; preds = %entry
  %16 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]*
@buf16, i64 0, i64 0)) ; <i32> [#uses=0]
  %17 = load double ()** @hlvm_time ; <double ()*> [#uses=1]
  %18 = call double %17() ; <double> [#uses=1]
  %19 = call fastcc i8* @gc_mark(i32 0) ; <i8*> [#uses=0]
  %20 = load double* @mark_time ; <double> [#uses=1]
  %21 = load double ()** @hlvm_time ; <double ()*> [#uses=1]
  %22 = call double %21() ; <double> [#uses=1]
  %23 = fadd double %20, %22 ; <double> [#uses=1]
  %24 = fsub double %23, %18 ; <double> [#uses=1]
  store double %24, double* @mark_time
  %25 = load double ()** @hlvm_time ; <double ()*> [#uses=1]
  %26 = call double %25() ; <double> [#uses=1]
  %27 = call fastcc i8* @gc_sweep(i32 0) ; <i8*> [#uses=0]
  %28 = load double* @sweep_time ; <double> [#uses=1]
  %29 = load double ()** @hlvm_time ; <double ()*> [#uses=1]
  %30 = call double %29() ; <double> [#uses=1]
  %31 = fadd double %28, %30 ; <double> [#uses=1]
  %32 = fsub double %31, %26 ; <double> [#uses=1]
  store double %32, double* @sweep_time
  %33 = load i32* @n_allocated ; <i32> [#uses=1]
  %34 = shl i32 %33, 2 ; <i32> [#uses=1]
  %35 = add i32 %34, 256 ; <i32> [#uses=1]
  store i32 %35, i32* @quota
  br label %cont

cont: ; preds = %fail, %entry
  %36 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([10 x i8]*
@buf17, i64 0, i64 0)) ; <i32> [#uses=0]
  %37 = extractvalue %0 %4, 1 ; <i32> [#uses=1]
  %38 = icmp sgt i32 %37, %1 ; <i1> [#uses=1]
  br i1 %38, label %pass2, label %fail1

fail1: ; preds = %cont
  store i32 %5, i32* @shadow_stack_depth
  store %2 %3, %2* %0
  ret i8* undef

pass2: ; preds = %cont
  %39 = add i32 %1, 1 ; <i32> [#uses=1]
  %40 = icmp slt i32 %1, 0 ; <i1> [#uses=1]
  br i1 %40, label %fail6, label %pass4

pass4: ; preds = %pass2
  %41 = extractvalue %0 %4, 1 ; <i32> [#uses=1]
  %42 = icmp sgt i32 %41, %1 ; <i1> [#uses=1]
  br i1 %42, label %cont8, label %fail6

fail6: ; preds = %pass4, %pass2
  %43 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([27 x i8]*
@buf5, i64 0, i64 0)) ; <i32> [#uses=0]
  call void @exit(i32 1)
  br label %cont8

cont8: ; preds = %fail6, %pass4
  %44 = extractvalue %0 %4, 2 ; <i8*> [#uses=1]
  %45 = bitcast i8* %44 to double* ; <double*> [#uses=1]
  %46 = sext i32 %1 to i64 ; <i64> [#uses=1]
  %47 = getelementptr double* %45, i64 %46 ; <double*> [#uses=1]
  %48 = load double* %47 ; <double> [#uses=1]
  %49 = call fastcc i8* %2(%2* %6, %2 %3, double %48) ; <i8*> [#uses=0]
  %50 = load %2* %6 ; <%2> [#uses=1]
  store i32 %5, i32* @shadow_stack_depth
  %51 = tail call fastcc i8* @fold_aux(%2* %0, i32 %39, i8* (%2*, %2,
double)* %2, %2 %50, %0 %4) ; <i8*> [#uses=1]
  ret i8* %51
}

That last call is the critical tail call and AFAICT it is stack overflowing.
Anyone seen any problems like this before?

I'll try removing the noalias and seeing if that fixes it. If so, I'll file a
bug report. Which part of LLVM would this be a bug in if things like noalias
and noreturn inhibit TCO when they shouldn't?

CodeGen. I looked at noalias, and it was indeed a bug. This is now
fixed on trunk in r88672. Noreturn is a different issue.

Dan

Thanks: I tested it and removing noalias did indeed fix that stack overflow.
Unfortunately, the generated code is still broken in some other way that is
causing a segfault. I'll keep digging...