RFC: Tail call optimization X86

Hi Evan,

first off thanks to you and Chris for taking time.

We'd like to see tail call optimization to be similar to the target
independent lowering of ISD::CALL nodes. These are auto-generated
from ???CallingConv.td files. Some target specific details such as
function address register (ECX in your example) should be coded in
these files. Ditto for stack offsets, etc. These should be part of the
fastcc calling convention. The tail call lowering will use these
information and target hooks to perform tail call optimization on
fastcc calls that are really tail calls.

I think it's reasonable to start by doing a (clean) target specific
implementation for X86. Just understand we'd like to move to a target
independent mechanism so the earlier implementation may not survive.

Good that roughly sounds like what i did.
(except probably for the clean bit :slight_smile:

As for what you described here. I am having a hard time following
it. :slight_smile: Please send patch.

Okay here is a patch containing the tail call optimization code
for X86. It mostly does what i have described in preceeding
emails:

There is new code for lowering fastcc calls:
LowerX86_32FastCCCallTo
LowerX86_32FastCCArguments

There is some code checking whether a TAIL CALL really is
eligible for tail call optimization.

I modified:
LowerRET
to create a TC_RETURN node. (used for adjusting stackpointer and
jumping to function in epilogue, similar to EH_RETURN)

There is a new calling convention:
CC_X86_32_TailCall
which is ~ CC_X86_32_C minus ECX.

There are some modifications in
X86InstrInfo.td

There are new X86 DAG nodes:
TRUETAILCALL
TC_RETURN

There is some voodoo in X86RegisterInfo.cpp to deal with the
RETADDR problem described in X86ISelLowering.cpp comments.
(callee has more args than caller -> there needs to be space
between RETADDR stack slot and ebp/spilled registers for safely
moving the RETADDR to)

any comments, critique, suggestions welcome.
regards arnold

tailcall10.patch (43 KB)

Hi Arnold,

Thanks for the patch. Some questions and commons:

1. Have you test it against the llvm test suite? Does it work if fp elimination optimization is turned off?
2. Please follow llvm coding convention and make sure every line fits in 80 columns.
3.
enum NameDecorationStyle {
    None,
    StdCall,
- FastCall
+ FastCall,
+ FastCC // the normal fastcc calling convention
};

Why is FastCC necessary? Can't you just use FastCall?

4.
def X86tailcall: SDNode<"X86ISD::TAILCALL", SDT_X86Call,
                          [SDNPHasChain, SDNPOutFlag, SDNPOptInFlag]>;
+def X86truetailcall: SDNode<"X86ISD::TRUETAILCALL", SDT_X86Call,
+ [SDNPHasChain, SDNPOutFlag, SDNPOptInFlag]>;

From: Evan Cheng <evan.cheng@apple.com>
Date: 11 September 2007 19:26:39 GMT+02:00
To: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC: Tail call optimization X86
Reply-To: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu>

Hi Arnold,

Thanks for the patch. Some questions and commons:

1. Have you test it against the llvm test suite?

No not yet.

Does it work if fp
elimination optimization is turned off?

For my test cases - yes.

2. Please follow llvm coding convention and make sure every line fits
in 80 columns.

Okay dokey. i was looking through the files manually - must have missed some - sorry.

3.
enum NameDecorationStyle {
    None,
    StdCall,
- FastCall
+ FastCall,
+ FastCC // the normal fastcc calling convention
};

Why is FastCC necessary? Can't you just use FastCall?

FastCall is used to indicate a function has x86_FastCall semantics if i am not mistaken.
I needed to differentiate between a normal fastcc and the x86_fastcall semantics in an older version
of my code. I no longer depend on that so it can be removed as you suggest. sorry for the code corpse :slight_smile:

4.
def X86tailcall: SDNode<"X86ISD::TAILCALL", SDT_X86Call,
                          [SDNPHasChain, SDNPOutFlag, SDNPOptInFlag]>;
+def X86truetailcall: SDNode<"X86ISD::TRUETAILCALL", SDT_X86Call,
+ [SDNPHasChain, SDNPOutFlag, SDNPOptInFlag]>;
+

Please use X86tailcall. It's not currently used so feel free to
change its patterns, etc.

Okay dokey.

5.
+// the following two instructions are used to adjust the stack pointer
+// in the case where the callee has more arguments than the caller
+// an area is created where the return addr can be safely moved to
+let isConvertibleToThreeAddress = 1 in { // Can transform into LEA.
+def TCADD32ri : Ii32<0x81, MRM0r, (outs GR32:$dst), (ins GR32:
$src1, i32imm:$src2),
+ "add{l}\t{$src2, $dst|$dst, $src2}",
+ []>;
+}
+
+def TCSUB32ri : Ii32<0x81, MRM5r, (outs GR32:$dst), (ins GR32:
$src1, i32imm:$src2),
+ "sub{l}\t{$src2, $dst|$dst, $src2}",
+ []>;
+

In an intermediate implementation (that did not work anyway) i need to differentiate
  between the normal ADD32ri/SUB32ri that were in the code and the stack
adjustments i added. I no longer need them so i will remove them. again a code
corpse sorry for that.

+//let isCall = 1, isTerminator = 1, isReturn = 1, isBarrier = 1 in
+//def TCRETURNmi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:
$offset),
+// "#TC_RETURN $dst $offset",
+// []>;
+

This node holds the stack adjustment delta and the jmp destination of the call.
The information is used in epilogue generation.

Why are these needed? They don't look any different from normal add
and subtraction instructions. Why not just use ADJCALLSTACKDOWN and
ADJCALLSTACKUP?

6.
+ } else if (RetOpcode == X86::TCRETURNdi) {
+ // a tailcall adjust the stack
...
+ } else if (RetOpcode == X86::TCRETURNri) {
+ MBBI = prior(MBB.end());

Seems like there are quite a bit of duplicate code between the 2
cases. Please refactor.

Okay dokey

7.
X86TargetLowering::X86TargetLowering(TargetMachine &TM)
    : TargetLowering(TM) {
+ IsLastCallTailCall = false;

This is probably not a good idea. You are assuming nodes are lowered
in certain order, that is dangerous. It's probably better to check
whether the last call is a tail call on the fly as you are processing
the return node.

Will do so.

8.
-
- SDOperand Chain = Op.getOperand(0);
- SDOperand Flag;
-
- // Copy the result values into the output registers.
- if (RVLocs.size() != 1 || !RVLocs[0].isRegLoc() ||
- RVLocs[0].getLocReg() != X86::ST0) {
- for (unsigned i = 0; i != RVLocs.size(); ++i) {
- CCValAssign &VA = RVLocs[i];
- assert(VA.isRegLoc() && "Can only return in registers!");
- Chain = DAG.getCopyToReg(Chain, VA.getLocReg(), Op.getOperand
(i*2+1),
- Flag);
- Flag = Chain.getValue(1);
- }
+ if (IsLastCallTailCall) {
+ IsLastCallTailCall = false;
+ SDOperand TailCall = GetTailCall(Op);
+ SDOperand TargetAddress = TailCall.getOperand(1);
+ SDOperand StackAdjustment = TailCall.getOperand(2);
+ assert ( ((TargetAddress.getOpcode() == ISD::Register &&
+ cast<RegisterSDNode>(TargetAddress)->getReg() ==
X86::ECX) ||
+ TargetAddress.getOpcode() == ISD::TargetExternalSymbol ||
+ TargetAddress.getOpcode() == ISD::TargetGlobalAddress) &&
+ "Expecting an global address, external symbol, or
register");
+ assert( StackAdjustment.getOpcode() == ISD::Constant &&
"Expecting a const value");
+ // TODO: should we add flag from tail call as last operand?
+ return DAG.getNode(X86ISD::TC_RETURN, MVT::Other, Op.getOperand
(0), TargetAddress, StackAdjustment);
    } else {

Since if (IsLastCallTailCall) { ... } always ends with an early
exist, you can just place the block before if (RVLocs.size() ...) and
there is no need to change indentation for the rest of the function.

Okay

9.
+// Check to see whether the next instruction following the call is a
return
+// A function is eligable if caller/callee calling conventions match
and the
+// function CALL is immediatly followed by a RET
+bool X86TargetLowering::IsEligibleForTailCallElimination(SDOperand
Call, SelectionDAG& DAG, unsigned CalleeCC, SDOperand Callee) {
+ bool IsEligible = false;
+ SDNode * CallNode = Call.Val;
...

+SDOperand X86TargetLowering::LowerX86_32FastCCCallTo(SDOperand Op,
+ SelectionDAG &DAG,
+ unsigned CC) {
+ DOUT << "LowerX86_32FastCCCallTo\n";
+ SDOperand Chain = Op.getOperand(0);
+ bool isVarArg = cast<ConstantSDNode>(Op.getOperand(2))-

getValue() != 0;

+ bool isTailCall = cast<ConstantSDNode>(Op.getOperand(3))-

getValue() != 0;

+ SDOperand Callee = Op.getOperand(4);
+ //unsigned NumOps = (Op.getNumOperands() - 5) / 2;
+
+ // Analyze operands of the call, assigning locations to each operand.
+ SmallVector<CCValAssign, 16> ArgLocs;
+ CCState CCInfo(CC, isVarArg, getTargetMachine(), ArgLocs);
+ CCInfo.AnalyzeCallOperands(Op.Val, CC_X86_32_TailCall);
+ if (isTailCall &&
+ IsEligibleForTailCallElimination(Op, DAG,CC, Callee) &&
+ PerformTailCallOpt) {

IsEligibleForTailCallElimination() should be a target hook. This way
TargetLowering::LowerCallTo() can determine whether a call is
eligible for tail call elimination and insert the current ISD::CALL
node.

Okay

X86TargetLowering::LowerX86_32FastCCCallTo() will not have to
handle non-tail calls.

I will then add code to LowerCCCCallTo() to check whether to use the normal
calling convention CC_X86_32_C (3 registers free for argument passing) or
the tail call (fastcc) calling convention CC_X86_32_TailCall (2 register free).

Expect to see a new patch after i have done the testing with the whole test suite.

regards arnold

From: Evan Cheng <evan.cheng@apple.com>
Date: 11 September 2007 19:26:39 GMT+02:00
To: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC: Tail call optimization X86
Reply-To: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu>

Hi Arnold,

Thanks for the patch. Some questions and commons:

1. Have you test it against the llvm test suite?

No not yet.

Ok, I don't expect everything would just work. But please run at least SingleSource and MultiSource and let us know what breaks. Once it's proven to pass a good chunk of the tests it can be checked in but turned off. This will allow us to test it as llcbeta.

Expect to see a new patch after i have done the testing with the
whole test suite.

Sounds good. Thanks.

Evan

Ok, I don't expect everything would just work. But please run at least
SingleSource and MultiSource and let us know what breaks. Once it's
proven to pass a good chunk of the tests it can be checked in but
turned off. This will allow us to test it as llcbeta.

I just ran the test suite in llvm-test with a vanilla copy of the svn
tree on darwin-x86.
To perform the tests i downloaded the test-suite as instructed on the web.
cd'ed into llvm-test. configured (--withllvmfoo etc) and make'd ;).
What worries me is that 148 tests are failing. to determine this
number i redirected the output (stdout+stderr) to a file and grep FAIL

wc -l on it.

Is the number of 148 failing tests sane or could this indicate there
is something bogus with my setup?
When i do a make in the llvm-test directory all tests are executed right?

regards arnold

Something isn't right with your setup. Make sure your llvm-gcc is update to date (and built against tot llvm). Also try reconfiguring llvm.

Evan

I just ran the test suite in llvm-test with a vanilla copy of the svn
tree on darwin-x86.
To perform the tests i downloaded the test-suite as instructed on the web.
cd'ed into llvm-test. configured (--withllvmfoo etc) and make'd ;).
What worries me is that 148 tests are failing. to determine this
number i redirected the output (stdout+stderr) to a file and grep FAIL
> wc -l on it.
Is the number of 148 failing tests sane or could this indicate there
is something bogus with my setup?
When i do a make in the llvm-test directory all tests are executed right?

My advice is to run "make TEST=nightly report". This will generate a txt file that is a simple table of the results... and show you what is failing (llc, jit, cbe, etc).

148 failures seems high to me. On my darwin x86 machine, I only am seeing 75 failures (running make TEST=nightly report), but my last complete report was from a couple weeks ago.

-Tanya