RFC: Tail call optimization X86

The patch is against revision 42247.

tailcall-src.patch (61.2 KB)

Hi Arnold,

This is a very good first step! Thanks! Comments below.

Evan

Index: test/CodeGen/X86/constant-pool-remat-0.ll

+; RUN: llvm-as < %s | llc -march=x86 -mattr=+sse2 -stats -info-
output-file - | grep asm-printer | grep 9
+; change preceeding line form ... | grep 8 to ..| grep 9 since
+; with new fastcc has std call semantics causing a stack adjustment
+; after the function call

Not sure if I understand this. Can you illustrate with an example?

Sure

The code generated used to be
_array:
         subl $12, %esp
         movss LCPI1_0, %xmm0
         mulss 16(%esp), %xmm0
         movss %xmm0, (%esp)
         call L_qux$stub
         mulss LCPI1_0, %xmm0
         addl $12, %esp
         ret

FastCC use to be caller pops arguments so there was no stack adjustment after the
call to qux. Now FastCC has callee pops arguments on return semantics so the
x86 backend inserts a stack adjustment after the call.

_array:
    subl $12, %esp
         movss LCPI1_0, %xmm0
         mulss 16(%esp), %xmm0
         movss %xmm0, (%esp)
         call L_qux$stub
         subl $4, %esp << stack adjustment because qux pops arguments on return
         mulss LCPI1_0, %xmm0
         addl $12, %esp
         ret $4

Index: include/llvm/Target/TargetLowering.h

--- include/llvm/Target/TargetLowering.h (revision 42247)
+++ include/llvm/Target/TargetLowering.h (working copy)
@@ -851,8 +851,18 @@
    virtual std::pair<SDOperand, SDOperand>
    LowerCallTo(SDOperand Chain, const Type *RetTy, bool RetTyIsSigned,
                bool isVarArg, unsigned CallingConv, bool isTailCall,
- SDOperand Callee, ArgListTy &Args, SelectionDAG &DAG);
+ bool isNextInstRet, SDOperand Callee, ArgListTy &Args,
+ SelectionDAG &DAG);
+ // IsEligibleForTailCallElimination - Check whether the call is
eligible for
+ // tailcall elimination
+ virtual bool IsEligibleForTailCallElimination(SelectionDAG& DAG,
+ bool IsNextInstRet,
+ SDOperand Callee,
+ unsigned CalleeCC)
const {
+ return false;
+ }
+

I am not too keen on "isNextInstRet". We should never use instruction
ordering before scheduling to determine whether it's possible to
perform an optimization. IsEligibleForTailCallElimination() should
determine the feasibility on its own, no?

My assumption:
Tail call optimization is performed if the instruction following the call is
a return and the return uses the result of the call or is a void return.

The problem is that at the point (in TargetLowering:LowerCallTo) where
IsEligibleForTailCallElimination is called the SDOperand node for the
following instructions (we are looking for a return) has not been build.
So IsEligibleForTailCallElimination can't determine - using 'Callee' -
whether the instruction following the call is a return. That is the reason
why i pass the IsNextInstRet flag to TargetLowering::LowerCallTo.

TargetLowering::LowerCallTo(SDOperand Chain, const Type *RetTy,
                             bool RetTyIsSigned, bool isVarArg,
                             unsigned CallingConv, bool isTailCall,
                             bool isNextInstRet, SDOperand Callee,
                             ArgListTy &Args, SelectionDAG &DAG) {
   SmallVector<SDOperand, 32> Ops;
   Ops.push_back(Chain); // Op#0 - Chain
   Ops.push_back(DAG.getConstant(CallingConv, getPointerTy())); // Op#1 - CC
   Ops.push_back(DAG.getConstant(isVarArg, getPointerTy())); // Op#2 - VarArg
   if (isTailCall &&
       IsEligibleForTailCallElimination(DAG, isNextInstRet, Callee, CallingConv))
     Ops.push_back(DAG.getConstant(true, getPointerTy())); // Op#3 - Tail

The instructions i am refering here to are LLVM IR instructions not
target machine instructions. I could remove isNextInstRet from
IsEligibleForTailCallElimination because in TargetLowering::LowerCallTo
it is clear that we can't do tail call optimization if isNextInstRet (this time coming from
SelectionDAGLowering::LowerCallTo) is false.

if (isTailCall && isNextInstRet &&
       IsEligibleForTailCallElimination(DAG, Callee, CallingConv))

Note that I interpreted the following paragraph of you:

IsEligibleForTailCallElimination() should be a target hook. This way
TargetLowering::LowerCallTo() can determine whether a call is
eligible for tail call elimination and insert the current ISD::CALL
node. X86TargetLowering::LowerX86_32FastCCCallTo() will not have to
handle non-tail calls.

in the following way:
I check in TargetLowering:LowerCallTo whether the call IsEligibleForTailCallElimination
and insert a ISD:CALL node with the tailcall attribute set or not.

Then in X86TargetLowering::LowerCALL i can dispatch the right lowering code:

SDOperand X86TargetLowering::LowerCALL(SDOperand Op, SelectionDAG &DAG) {
  unsigned CallingConv = cast<ConstantSDNode>(Op.getOperand(1))->getValue();
  bool isTailCall = cast<ConstantSDNode>(Op.getOperand(3))->getValue() != 0;

  case CallingConv::Fast:
      if (isTailCall && PerformTailCallOpt)
        return LowerX86_TailCallTo(Op, DAG, CallingConv);
      else
        return LowerCCCCallTo(Op,DAG, CallingConv);

Some stylistic nitpicks. Please write the comments as:
/// IsNextInstructionReturn - Check whether..

Will do.

+ assert(BI != BB->end() && "Woohooa");
Better assertion messages please. :slight_smile:

Why not write it like this:

okay

Also, shouldn't this function be "static"?

okay

Please fix the inconsistency: "TAILCALL" vs. "TAIL CALL".

okay

+SDOperand GetPossiblePreceedingTailCall(SDOperand Chain) {

This should be "static"?

okay

+
+ Chain = DAG.getNode( X86ISD::CALL,
                        NodeTys, &Ops[0], Ops.size());

Stylistic nitpick: please merge 2 lines and remove the space after '('.

okay

- Chain = DAG.getNode(X86ISD::FP_SET_RESULT, Tys, Ops, 2);
- Flag = Chain.getValue(1);
+ Chain = DAG.getNode(X86ISD::FP_SET_RESULT, Tys, Ops, 2);
+ Flag = Chain.getValue(1);

What happened here?

Ups corrected.

+// Check to see whether the next instruction following the call is a
return
+// A function is eligible if caller/callee calling conventions
match, currently
+// only fastcc supports tail calls, and the function CALL is
immediatly followed
+// by a RET
+bool X86TargetLowering::IsEligibleForTailCallElimination
(SelectionDAG& DAG,
+ bool
IsNextInstRet,
+ SDOperand
Callee,
+ unsigned
CalleeCC)
+ const {

Having "const {" on a separate line looks funky. :slight_smile:

will change :wink:

Please remove code that's commented out.

if( G ==> if (G :slight_smile:

No need to set IsEligible to false since it's initialized to false.

:wink: okay dokey

regards

+; RUN: llvm-as < %s | llc -march=x86 -mattr=+sse2 -stats -info-
output-file - | grep asm-printer | grep 9
+; change preceeding line form ... | grep 8 to ..| grep 9 since
+; with new fastcc has std call semantics causing a stack adjustment
+; after the function call

Not sure if I understand this. Can you illustrate with an example?

Sure

The code generated used to be
_array:
        subl $12, %esp
        movss LCPI1_0, %xmm0
        mulss 16(%esp), %xmm0
        movss %xmm0, (%esp)
        call L_qux$stub
        mulss LCPI1_0, %xmm0
        addl $12, %esp
        ret

FastCC use to be caller pops arguments so there was no stack
adjustment after the
call to qux. Now FastCC has callee pops arguments on return semantics
so the
x86 backend inserts a stack adjustment after the call.

_array:
   subl $12, %esp
        movss LCPI1_0, %xmm0
        mulss 16(%esp), %xmm0
        movss %xmm0, (%esp)
        call L_qux$stub
        subl $4, %esp << stack adjustment because qux pops
arguments on return
        mulss LCPI1_0, %xmm0
        addl $12, %esp
        ret $4

Ok. Please make sure for now this only takes effect if tail call optimization is turned on though. Also, we need to fold the subl into addl if possible.

Index: include/llvm/Target/TargetLowering.h

--- include/llvm/Target/TargetLowering.h (revision 42247)
+++ include/llvm/Target/TargetLowering.h (working copy)
@@ -851,8 +851,18 @@
   virtual std::pair<SDOperand, SDOperand>
   LowerCallTo(SDOperand Chain, const Type *RetTy, bool
RetTyIsSigned,
               bool isVarArg, unsigned CallingConv, bool isTailCall,
- SDOperand Callee, ArgListTy &Args, SelectionDAG &DAG);
+ bool isNextInstRet, SDOperand Callee, ArgListTy &Args,
+ SelectionDAG &DAG);
+ // IsEligibleForTailCallElimination - Check whether the call is
eligible for
+ // tailcall elimination
+ virtual bool IsEligibleForTailCallElimination(SelectionDAG& DAG,
+ bool IsNextInstRet,
+ SDOperand Callee,
+ unsigned CalleeCC)
const {
+ return false;
+ }
+

I am not too keen on "isNextInstRet". We should never use instruction
ordering before scheduling to determine whether it's possible to
perform an optimization. IsEligibleForTailCallElimination() should
determine the feasibility on its own, no?

My assumption:
Tail call optimization is performed if the instruction following the
call is
a return and the return uses the result of the call or is a void return.

Ok. But using whether the next instruction is a return at this time is too restrictive. It's possible there may be instructions that are between the call and the return that do not interfere with tail call optimization.

The problem is that at the point (in TargetLowering:LowerCallTo) where
IsEligibleForTailCallElimination is called the SDOperand node for the
following instructions (we are looking for a return) has not been build.
So IsEligibleForTailCallElimination can't determine - using 'Callee' -
whether the instruction following the call is a return. That is the
reason
why i pass the IsNextInstRet flag to TargetLowering::LowerCallTo.

Right. But I think the right way to deal with this is to assume all the tail calls are eligible for tail call optimizations at this point. Then checks for eligibility and fix the call nodes after all the instructions are lowered.

Evan

> FastCC use to be caller pops arguments so there was no stack
> adjustment after the
> call to qux. Now FastCC has callee pops arguments on return semantics
> so the
> x86 backend inserts a stack adjustment after the call.
>
> _array:
> subl $12, %esp
> movss LCPI1_0, %xmm0
> mulss 16(%esp), %xmm0
> movss %xmm0, (%esp)
> call L_qux$stub
> subl $4, %esp << stack adjustment because qux pops
> arguments on return
> mulss LCPI1_0, %xmm0
> addl $12, %esp
> ret $4

Ok. Please make sure for now this only takes effect if tail call
optimization is turned on though. Also, we need to fold the subl into
addl if possible.

I think you misunderstood me. I have not added code to insert this
subl it is done already by the backend (before my changes) with
certain calling conventions (callee pops arguments). There used to be
already calling conventions e.g. x86_fastcall that would have cause
the stack adjustment only fastcc was not one of them. Now that fastcc
can cause tail call optimization i had to change the convention from
caller pops arguments to callee pops arguments in order to allow tail
call optimization in a general way.

>
>
>> Index: include/llvm/Target/TargetLowering.h
>> ===================================================================
>> --- include/llvm/Target/TargetLowering.h (revision 42247)
>> +++ include/llvm/Target/TargetLowering.h (working copy)
>> @@ -851,8 +851,18 @@
>> virtual std::pair<SDOperand, SDOperand>
>> LowerCallTo(SDOperand Chain, const Type *RetTy, bool
>> RetTyIsSigned,
>> bool isVarArg, unsigned CallingConv, bool isTailCall,
>> - SDOperand Callee, ArgListTy &Args, SelectionDAG &DAG);
>> + bool isNextInstRet, SDOperand Callee, ArgListTy &Args,
>> + SelectionDAG &DAG);
>> + // IsEligibleForTailCallElimination - Check whether the call is
>> eligible for
>> + // tailcall elimination
>> + virtual bool IsEligibleForTailCallElimination(SelectionDAG& DAG,
>> + bool IsNextInstRet,
>> + SDOperand Callee,
>> + unsigned CalleeCC)
>> const {
>> + return false;
>> + }
>> +
>>
>> I am not too keen on "isNextInstRet". We should never use instruction
>> ordering before scheduling to determine whether it's possible to
>> perform an optimization. IsEligibleForTailCallElimination() should
>> determine the feasibility on its own, no?
>
> My assumption:
> Tail call optimization is performed if the instruction following the
> call is
> a return and the return uses the result of the call or is a void
> return.

Ok. But using whether the next instruction is a return at this time is
too restrictive. It's possible there may be instructions that are
between the call and the return that do not interfere with tail call
optimization.

>
>
> The problem is that at the point (in TargetLowering:LowerCallTo) where
> IsEligibleForTailCallElimination is called the SDOperand node for the
> following instructions (we are looking for a return) has not been
> build.
> So IsEligibleForTailCallElimination can't determine - using
> 'Callee' -
> whether the instruction following the call is a return. That is the
> reason
> why i pass the IsNextInstRet flag to TargetLowering::LowerCallTo.

Right. But I think the right way to deal with this is to assume all
the tail calls are eligible for tail call optimizations at this point.
Then checks for eligibility and fix the call nodes after all the
instructions are lowered.

I am not sure that i understand where this point would be. As I view
it the only place would be in LegalizeDAG.cpp in

SDOperand SelectionDAGLegalize::LegalizeOp(SDOperand Op) {
...
case ISD::CALL:
    // The only option for this is to custom lower it.
    Tmp3 = TLI.LowerOperation(Result.getValue(0), DAG);

}

would change to (pseudo code)
SDOperand SelectionDAGLegalize::LegalizeOp(SDOperand Op) {
SDOperand Result = Op;
...
case ISD::CALL:
    // The only option for this is to custom lower it.

    Result = modifyCallSoThatTailAttributeReallyIndicatesWhetherThisCallIsEligibleForTailCallOptimization(Result);

    Tmp3 = TLI.LowerOperation(Result.getValue(0), DAG);

}

because any point after that (e.g. scheduling) would be to late
because the target specific loweringcode (e.g. LowerCALL ->
LowerCCCall)

FastCC use to be caller pops arguments so there was no stack
adjustment after the
call to qux. Now FastCC has callee pops arguments on return semantics
so the
x86 backend inserts a stack adjustment after the call.

_array:
      subl $12, %esp
        movss LCPI1_0, %xmm0
        mulss 16(%esp), %xmm0
        movss %xmm0, (%esp)
        call L_qux$stub
        subl $4, %esp << stack adjustment because qux pops
arguments on return
        mulss LCPI1_0, %xmm0
        addl $12, %esp
        ret $4

Ok. Please make sure for now this only takes effect if tail call
optimization is turned on though. Also, we need to fold the subl into
addl if possible.

I think you misunderstood me. I have not added code to insert this
subl it is done already by the backend (before my changes) with
certain calling conventions (callee pops arguments). There used to be
already calling conventions e.g. x86_fastcall that would have cause
the stack adjustment only fastcc was not one of them. Now that fastcc
can cause tail call optimization i had to change the convention from
caller pops arguments to callee pops arguments in order to allow tail
call optimization in a general way.

Hmmm. Ok. So this is due to X86CallingConv.td changes? Unfortunately that's not controlled by options. Ok then.

Index: include/llvm/Target/TargetLowering.h

--- include/llvm/Target/TargetLowering.h (revision 42247)
+++ include/llvm/Target/TargetLowering.h (working copy)
@@ -851,8 +851,18 @@
   virtual std::pair<SDOperand, SDOperand>
   LowerCallTo(SDOperand Chain, const Type *RetTy, bool
RetTyIsSigned,
               bool isVarArg, unsigned CallingConv, bool isTailCall,
- SDOperand Callee, ArgListTy &Args, SelectionDAG &DAG);
+ bool isNextInstRet, SDOperand Callee, ArgListTy &Args,
+ SelectionDAG &DAG);
+ // IsEligibleForTailCallElimination - Check whether the call is
eligible for
+ // tailcall elimination
+ virtual bool IsEligibleForTailCallElimination(SelectionDAG& DAG,
+ bool IsNextInstRet,
+ SDOperand Callee,
+ unsigned CalleeCC)
const {
+ return false;
+ }
+

I am not too keen on "isNextInstRet". We should never use instruction
ordering before scheduling to determine whether it's possible to
perform an optimization. IsEligibleForTailCallElimination() should
determine the feasibility on its own, no?

My assumption:
Tail call optimization is performed if the instruction following the
call is
a return and the return uses the result of the call or is a void
return.

Ok. But using whether the next instruction is a return at this time is
too restrictive. It's possible there may be instructions that are
between the call and the return that do not interfere with tail call
optimization.

The problem is that at the point (in TargetLowering:LowerCallTo) where
IsEligibleForTailCallElimination is called the SDOperand node for the
following instructions (we are looking for a return) has not been
build.
So IsEligibleForTailCallElimination can't determine - using
'Callee' -
whether the instruction following the call is a return. That is the
reason
why i pass the IsNextInstRet flag to TargetLowering::LowerCallTo.

Right. But I think the right way to deal with this is to assume all
the tail calls are eligible for tail call optimizations at this point.
Then checks for eligibility and fix the call nodes after all the
instructions are lowered.

I am not sure that i understand where this point would be. As I view
it the only place would be in LegalizeDAG.cpp in

No, in SelectionDAGISel.cpp:BuildSelectionDAG(). After line 4610, the complete DAG is available. You can add a call here to look at the last call and the terminator to decide whether the call is really a candidate for tail call optimization. You can fix it up at this point.

Evan

Sure it can be, you can set up custom predicates, for example the X86CallingConv.td file has:

class CCIfSubtarget<string F, CCAction A>
  : CCIf<!strconcat("State.getTarget().getSubtarget<X86Subtarget>().", F), A>;

It would be straight-forward to have a CCIf defined to check some command line argument. I think enabling this as llcbeta for a few nights makes sense before turning it on by default.

-Chris

Hi all,

I changed the code that checks whether a tail call is really eligible for optimization so that it performs the check/fix in SelectionDAGISel.cpp:BuildSelectionDAG() as suggest by Evan. Also eliminated an error that caused the remaining failing test cases in the test-suite.

The results look very nice (on darwin x86, r42486).
The same number (46) of failing test cases on patched version and vanilla version. LCCBETA was enabled on both. I changed the LCCBETA option in Makefile.programs in the patched version to include the tail-call-opt flags so that the optimization code gets exercised.

ifeq ($(ARCH),x86)
LLCBETAOPTION := -regalloc=local -fast -tail-call-opt -tail-call-opt-align-stack

I think enabling this as llcbeta for a few nights makes
sense before turning it on by default.

-Chris

What does turning on by default mean? Does that mean it is shown in llc -help or that it is performed every time (without requesting via the command line) when compiling?

I would not do the latter since tail call optimization clears the stack frame of the caller (sequence), possibly causing confusion when debugging also resulting code is probably not faster (since arguments are lowered to regular stack slot -where they would be if the call was a normal call - first and then moved to the real stack slot - onto the callers parameters. As noted in the source file this might be optimized in many cases by looking whether the actual parameters come from the caller function parameters and would be overwritten if they are not moved. In cases where they would not be overwritten they could be directly lowered to the real stack slot).

regards arnold

tailcall-r42525-src.patch (59.1 KB)

Hi all,

I changed the code that checks whether a tail call is really eligible for optimization so that it performs the check/fix in SelectionDAGISel.cpp:BuildSelectionDAG() as suggest by Evan. Also eliminated an error that caused the remaining failing test cases in the test-suite.

The results look very nice (on darwin x86, r42486).
The same number (46) of failing test cases on patched version and vanilla version. LCCBETA was enabled on both. I changed the LCCBETA option in Makefile.programs in the patched version to include the tail-call-opt flags so that the optimization code gets exercised.

Ok. I'll review the patch soon. Thanks.

ifeq ($(ARCH),x86)
LLCBETAOPTION := -regalloc=local -fast -tail-call-opt -tail-call-opt-align-stack

Please remove -regalloc=local -fast. We want to test this patch separately. Can you explain the advantages / disadvantages of -tail-call-opt-align-stack?

I think enabling this as llcbeta for a few nights makes
sense before turning it on by default.

-Chris

What does turning on by default mean? Does that mean it is shown in llc -help or that it is performed every time (without requesting via the command line) when compiling?

The later.

I would not do the latter since tail call optimization clears the stack frame of the caller (sequence), possibly causing confusion when debugging also resulting code is probably not faster (since arguments are lowered to regular stack slot -where they would be if the call was

Running it as llcbeta will hopefully help us identify the performance issues so we can fix them before it's turned on by default.

a normal call - first and then moved to the real stack slot - onto the callers parameters. As noted in the source file this might be optimized in many cases by looking whether the actual parameters come from the caller function parameters and would be overwritten if they are not moved. In cases where they would not be overwritten they could be directly lowered to the real stack slot).

Please add the potential optimizations (with concrete examples / assembly snippets) to README.txt under Target/X86. Thanks!

Evan

> ifeq ($(ARCH),x86)
> LLCBETAOPTION := -regalloc=local -fast -tail-call-opt -tail-call-opt-
> align-stack

okay i ll do another round of testing with
LLCBETAOPTION := -tail-call-opt -tail-call-opt-align-stack

Please remove -regalloc=local -fast. We want to test this patch
separately. Can you explain the advantages / disadvantages of -tail-
call-opt-align-stack?

When you do tail call optimization just before calling/jmping to the
callee you sometimes have to adjust the stack pointer.
e.g

int calller(int arg1, int arg2)
{
  return callee(arg2, arg1, 2*arg3;
}

conceptually before jumping to callee the stackpointer has to be
adjusted (by -4 bytes in the example). Now this can cause the stack to
be misaligned (according to the target abi. In order to prevent this
when calculating the argument size (and when tail-call-opt-align-stack
is enabled) i make the resulting argument size a multiple of the
target alignment (minus the return address slot).
So in the example above the argument stack slot size would be 12bytes
for both functions.(on darwin-x86 which requires the stack to be
16byte aligned - which i found out by mistake - because dynamically
linked function calls would not work :wink:
this results in a stack adjustment that is a multiple of the target
stack alignment.

Maybe the default should be to perform this stack alignment and turn
it off with a switch (tail-call-opt-disable-stack-alignment) when
required? which one would do if he knew that he never calls to
dynamically linked functions on darwin for example or only from a top
level function (one that is not called from others eg. main).
why didn't i made the switch tail-call-opt-disable-stack-alignment in
the first place? hmm stupidity comes to ones mind :wink: no i had the llvm
as a backend for functional languages in mind which might not require
the stack to be aligned and only call printf form a top level
function.

in short:
the disadvantage of -tail-call-opt-align-stack is that the function
stack frame is bigger.
the advantage is that it won't blow up if for example the darwin
linker requires functions to be 16 byte aligned :slight_smile:

> What does turning on by default mean? Does that mean it is shown in
> llc -help or that it is performed every time (without requesting via
> the command line) when compiling?

The later.

>
>
> I would not do the latter since tail call optimization clears the
> stack frame of the caller (sequence), possibly causing confusion
> when debugging also resulting code is probably not faster (since
> arguments are lowered to regular stack slot -where they would be if
> the call was

Running it as llcbeta will hopefully help us identify the performance
issues so we can fix them before it's turned on by default.

Okay.

Please add the potential optimizations (with concrete examples /
assembly snippets) to README.txt under Target/X86. Thanks!

will do.

regards arnold

ifeq ($(ARCH),x86)
LLCBETAOPTION := -regalloc=local -fast -tail-call-opt -tail-call-opt-
align-stack

Please remove -regalloc=local -fast. We want to test this patch
separately.

just did a test with
LLCBETAOPTION := -tail-call-opt -tail-call-opt-align-stack

this time only SPASS llc-beta fails (comparing with vanilla version)
also jit fails (exit 139) but in both versions (vanilla/patched) which it did not do when i last tested

but looking very briefly at the results i am not sure llc-beta really fails since the programm (SPASS) runs through and only the diff fails

as you can see below the tail call optimized version needs more clauses to do the prove but eventually end up with the same result:

tail-call-opt:
  ....
         Given clause: 29875[3:Res:23.2,29765.0] || member(U,singleton(universal_class))* -> .
SPASS V 2.1
SPASS beiseite: Proof found.
Problem: /Users/arnold/Desktop/testing/tc-test/llvm/projects/llvm-test/MultiSource/Applications/SPASS/problem.dfg
SPASS derived 39592 clauses, backtracked 10752 clauses and kept 23124 clauses.
SPASS allocated 3905 KBytes.
SPASS spent No Timing on this machine. on the problem.
                  No Timing on this machine. for the input.
                  No Timing on this machine. for the FLOTTER CNF translation.
                  No Timing on this machine. for inferences.
                  No Timing on this machine. for the backtracking.
                  No Timing on this machine. for the reduction.

--------------------------SPASS-STOP------------------------------
exit 0

vanilla:
  ....
         Given clause: 29876[3:Res:23.2,29766.0] || member(U,singleton(universal_class))* -> .
SPASS V 2.1
SPASS beiseite: Proof found.
Problem: /Users/arnold/Desktop/testing/tc-test/llvm/projects/llvm-test/MultiSource/Applications/SPASS/problem.dfg
SPASS derived 39572 clauses, backtracked 10747 clauses and kept 23119 clauses.
SPASS allocated 3885 KBytes.
SPASS spent No Timing on this machine. on the problem.
                  No Timing on this machine. for the input.
                  No Timing on this machine. for the FLOTTER CNF translation.
                  No Timing on this machine. for inferences.
                  No Timing on this machine. for the backtracking.
                  No Timing on this machine. for the reduction.

--------------------------SPASS-STOP------------------------------
exit 0

an arbitrary example taken from
> diff SPASS.out-llc SPASS.out-llc-beta
3870,3872c3870,3872
< Given clause: 29104[0:Res:357.1,166.0] || equal(universal_class,singleton_relation) -> member(y,element_relation)*.
< Given clause: 29115[0:Res:361.1,166.0] || equal(universal_class,singleton_relation) -> member(universal_class,element_relation)*.
< Given clause: 29213[3:SpR:29211.0,5371.0] || -> subclass(universal_class,complement(intersection(singleton(universal_class),universal_class)))*.

in short:
the disadvantage of -tail-call-opt-align-stack is that the function
stack frame is bigger.
the advantage is that it won't blow up if for example the darwin
linker requires functions to be 16 byte aligned :slight_smile:

Thank you for the explanation. I'd much rather not have this separate option. IMHO we should just make this be controlled by -tail-call-opt for now.

Evan

ifeq ($(ARCH),x86)
LLCBETAOPTION := -regalloc=local -fast -tail-call-opt -tail-call-opt-
align-stack

Please remove -regalloc=local -fast. We want to test this patch
separately.

just did a test with
LLCBETAOPTION := -tail-call-opt -tail-call-opt-align-stack

this time only SPASS llc-beta fails (comparing with vanilla version)
also jit fails (exit 139) but in both versions (vanilla/patched)
which it did not do when i last tested

You mean SPASS JIT fails without tail call optimization being turned on? Is this due to fastcc abi change (callee pops arguments)?

but looking very briefly at the results i am not sure llc-beta really
fails since the programm (SPASS) runs through and only the diff fails

If the output mismatches then it has failed. This will not stop the patch from being accepted but it will have to be fixed eventually.

Evan

Comments:

CheckDAGForTailCallsAndFixThem -
1.
   for (SelectionDAG::allnodes_iterator BE = DAG.allnodes_begin(),
+ BI = prior(DAG.allnodes_end()); BI != BE; BI--) {
Please use pre-decrement instead of post-decrement.

2. The function is slower than it should be. You are scanning all the nodes in the DAG twice. You should just examine DAG.getRoot() to make determine whether it's a return or not.

@@ -4618,6 +4662,12 @@
    // Lower the terminator after the copies are emitted.
    SDL.visit(*LLVMBB->getTerminator());

+ // check whether calls in this block are real tail calls fix up CALL nodes
+ // with correct tailcall attribute so that the target can rely on the tailcall
+ // attribute indicating whether the call is really eligible for tail call
+ // optimization
+ CheckDAGForTailCallsAndFixThem(DAG, TLI);

Hi Evan,
I incoporated the changes you request but to the following i have got a question:

Also, moving the option
there will allow us to change fastcc ABI (callee popping arguments)
only when this option is on. See Chris' email:

I am not to sure on that. because that would make modules compiled with the flag on incompatible with ones compiled without the flag off as stack behaviour would mismatch.
It would be no problem to make the behaviour dependent on the -tail-call-opt flag. i am not sure that this is a good idea?

pseudocode:

module 1: with -tailcallopt enabled
fastcc int callee(int arg1, int arg2) {
...
-> onreturn: pops 8 byte
}

module 2: no -tailcallopt
fastcc int caller() {
  int x= call fastcc callee();
  //!! caller pops the arguments => stack mismatch

callee pops the arguments but caller also wants to pop the arguments of the stack

Apparently i forgot to send the answer email to chris reponse. sorry for that.

Hmmm. Ok. So this is due to X86CallingConv.td changes? Unfortunately
that's not controlled by options. Ok then.

Sure it can be, you can set up custom predicates, for example the
X86CallingConv.td file has:

class CCIfSubtarget<string F, CCAction A>
  : CCIf<!strconcat("State.getTarget().getSubtarget<X86Subtarget>().", F), A>;

It would be straight-forward to have a CCIf defined to check some command
line argument. I think enabling this as llcbeta for a few nights makes
sense before turning it on by default.

No not directly. The code related to "caller/callee cleans arguments off the stack" is not controlled by the .td. It's controlled in code by the operands of CALLSEQ_END.

for example in SDOperand X86TargetLowering::LowerCCCCallTo:
...
   if (CC == CallingConv::X86_StdCall || CC == CallingConv::Fast) {
     if (isVarArg)
       NumBytesForCalleeToPush = isSRet ? 4 : 0;
     else
       NumBytesForCalleeToPush = NumBytes;
     assert(!(isVarArg && CC==CallingConv::Fast) &&
             "CallingConv::Fast does not support varargs.");
   } else {
     // If this is is a call to a struct-return function, the callee
     // pops the hidden struct pointer, so we have to push it back.
     // This is common for Darwin/X86, Linux & Mingw32 targets.
     NumBytesForCalleeToPush = isSRet ? 4 : 0;
   }

   NodeTys = DAG.getVTList(MVT::Other, MVT::Flag);
   Ops.clear();
   Ops.push_back(Chain);
   Ops.push_back(DAG.getConstant(NumBytes, getPointerTy()));
   Ops.push_back(DAG.getConstant(NumBytesForCalleeToPush, getPointerTy()));
   Ops.push_back(InFlag);
   Chain = DAG.getNode(ISD::CALLSEQ_END, NodeTys, &Ops[0], Ops.size());
   InFlag = Chain.getValue(1);

   The third operand is the number of bytes the callee pops of the stack on return (on x86). This gets lowered to a ADJCALLSTACKUP pseudo machineinstruction.

Later when X86RegisterInfo::eliminateCallFramePseudoInstr is called and framepointerelimination is enabled the following code gets called:
...
   else if (I->getOpcode() == X86::ADJCALLSTACKUP) {
     // If we are performing frame pointer elimination and if the callee pops
     // something off the stack pointer, add it back. We do this until we have
     // more advanced stack pointer tracking ability.
     if (uint64_t CalleeAmt = I->getOperand(1).getImm()) {
       unsigned Opc = (CalleeAmt < 128) ?
         (Is64Bit ? X86::SUB64ri8 : X86::SUB32ri8) :
         (Is64Bit ? X86::SUB64ri32 : X86::SUB32ri);
       MachineInstr *New =
         BuildMI(TII.get(Opc), StackPtr).addReg(StackPtr).addImm(CalleeAmt);
       MBB.insert(I, New);
     }
   }

I am not sure about a command line switch would toggling the stack adjusting behaviour of a function. Because if the function is called from a different module which was compiled with the opposite command line switch all hell would break loose (because it assumes callee pops arguments when it does not).

just did a test with
LLCBETAOPTION := -tail-call-opt -tail-call-opt-align-stack

this time only SPASS llc-beta fails (comparing with vanilla version)
also jit fails (exit 139) but in both versions (vanilla/patched)
which it did not do when i last tested

You mean SPASS JIT fails without tail call optimization being turned
on? Is this due to fastcc abi change (callee pops arguments)?

No i always test with 2 trees. One with my patches applied and a clean checkout from
svn with no patches to have a comparison. With r42608 also the clean version
showed a failure with SPASS JIT. Well not really clean since it is an evolution of
svn checkout and svn update but again none of my patches applied. so it might just
be something gone wrong there.

my patched version
SPASS llc-beta failed
SPASS jit failed

(semi-)clean version from svn
SPASS jit failed

but looking very briefly at the results i am not sure llc-beta really
fails since the programm (SPASS) runs through and only the diff fails

If the output mismatches then it has failed. This will not stop the
patch from being accepted but it will have to be fixed eventually.

will be investigating further of course

regards arnold

Hi Evan,
I incoporated the changes you request but to the following i have got
a question:

Also, moving the option
there will allow us to change fastcc ABI (callee popping arguments)
only when this option is on. See Chris' email:

I am not to sure on that. because that would make modules compiled
with the flag on incompatible with ones compiled without the flag off
as stack behaviour would mismatch.
It would be no problem to make the behaviour dependent on the -tail-
call-opt flag. i am not sure that this is a good idea?

In theory, any function can be marked fastcc. But llvm-gcc c / c++ frontend does *not* mark any external functions fastcc. Also, the optimizer only changes the calling convention of internal functions to fastcc. We can set a policy of treating fastcc external functions as c functions. Then there is no chance of introducing ABI incompatibility.

pseudocode:

module 1: with -tailcallopt enabled
fastcc int callee(int arg1, int arg2) {
...
-> onreturn: pops 8 byte
}

module 2: no -tailcallopt
fastcc int caller() {
  int x= call fastcc callee();
  //!! caller pops the arguments => stack mismatch

callee pops the arguments but caller also wants to pop the arguments
of the stack

Apparently i forgot to send the answer email to chris reponse. sorry
for that.

Hmmm. Ok. So this is due to X86CallingConv.td changes? Unfortunately
that's not controlled by options. Ok then.

Sure it can be, you can set up custom predicates, for example the
X86CallingConv.td file has:

class CCIfSubtarget<string F, CCAction A>
  : CCIf<!strconcat("State.getTarget().getSubtarget<X86Subtarget>
().", F), A>;

It would be straight-forward to have a CCIf defined to check some
command
line argument. I think enabling this as llcbeta for a few nights
makes
sense before turning it on by default.

No not directly. The code related to "caller/callee cleans arguments
off the stack" is not controlled by the .td. It's controlled in code
by the operands of CALLSEQ_END.

Ok, that's even easier to have it be controlled by the command line option.

Evan

Hi Evan,
I incoporated the changes you request but to the following i have got
a question:

Also, moving the option
there will allow us to change fastcc ABI (callee popping arguments)
only when this option is on. See Chris' email:

I am not to sure on that. because that would make modules compiled
with the flag on incompatible with ones compiled without the flag off
as stack behaviour would mismatch.
It would be no problem to make the behaviour dependent on the -tail-
call-opt flag. i am not sure that this is a good idea?

In theory, any function can be marked fastcc. But llvm-gcc c / c++
frontend does *not* mark any external functions fastcc. Also, the
optimizer only changes the calling convention of internal functions
to fastcc.

well i hope llvm-gcc won't stay the only front-end (hoping to see some
functional languages like ocaml :wink:

We can set a policy of treating fastcc external functions
as c functions. Then there is no chance of introducing ABI
incompatibility.

this means limiting tail call opt to protected/invisible functions within a module?
a little limiting i think.

but i think if it is documented that there will be a incompatibility between
modules compiled with tailcallopt on/off that is okay?
and this would only happen if someone other than llvm-gcc used llvm as a backend
anyway.

so i will make tailcallopt toggle the stack adjusting behaviour. with llvm-gcc this won't be a problem
and if someone other using llvm as a backend runs in to a problem than there is the documentation
and the maillist that will help.

is that okay?

regards arnold

Hi Evan,
I incoporated the changes you request but to the following i have got
a question:

Also, moving the option
there will allow us to change fastcc ABI (callee popping arguments)
only when this option is on. See Chris' email:

I am not to sure on that. because that would make modules compiled
with the flag on incompatible with ones compiled without the flag off
as stack behaviour would mismatch.
It would be no problem to make the behaviour dependent on the -tail-
call-opt flag. i am not sure that this is a good idea?

In theory, any function can be marked fastcc. But llvm-gcc c / c++
frontend does *not* mark any external functions fastcc.

The user can specify the "fastcall" and "stdcall" attributes, and it looks
to me like llvm-gcc honors that. Certainly it should.

This would imply one fastcc abi (callee pops args on return) to rule them all?
That is only if fastcall translates to llvm fastcc of course.

regards