RFC: ARM Strategy for adding tail call optimization to thumb1 targets

Hello,

after a while of code-reading and analysis I have come up with a plan on how to add tail call optimizations to thumb1 targets. As a beginner within LLVM, I would very much appreciate feedback and advice before proceeding further.

I identified two main complexities rendering tail call optimization more difficult for thumb1 targets (this is also already stated in some very old comment within ARMTargetLowering::IsEligibleForTailCallOptimization that does not seem to mach current state of the code at some minor points):

1.) Unlike the BL instruction the B instruction only reaches targets in up to ~2k byte distance. This means that for true tail calls the jump target address needs to be placed in a register.
2.) We need a scratch register to restore LR contents in the epilogue if LR has previously been pushed on the stack.

My plan is to implement the following:

1.) When lowering tail calls during DAG generation, arrange for the call target adress always being loaded into a register. So that actually only ARM::TCRETURNri pseudo opcodes will be generated for tail calls. (Penalty of this is will be one additional mov r12,r3 in case of exactly 3 parameters being passed on the stack for constant addresses, see below).

2.) In case that r0 up to r3 are used for parameter passing. I noticed, that the thumb1 register allocator settings will not be able to allocate a suitable register. My plan is to somehow force the tail call jump target into hard reg r12 in this case before register allocation. If at least r3 is free, the register allocator will be able to find a register for the TCRETURNri pseudo instruction and we don't have a problem until epilogue generation.

QUESTION 1: How would be the "right" way to force the target address into the hard reg r12 in case that r0 up to r3 are used for params? Should this check and the "force to hard reg r12" code go to ARMTargetLowering::LowerCall or should this be done within the "legalize DAG" pass that seems to exist somewhere. Is there an example on how to force a value into a hard reg?

3.) After Steps 1.) and 2.), Thumb1FrameLowering::emitEpilogue will find TCRETURNri pseudo instructions for tail calls. We could detect this easily and probably handle this case within a new private member Thumb1FrameLowering::emitTailCallEpilogue ().

Thumb1FrameLowering::emitTailCallEpilogue () will have to arrange for
- updating the stack pointer,
- restoring GP registers pushed on the stack,
- restoring LR and
- jumping to the target address.

The problem is that restoring LR requires a scratch register because pop {lr} is not available. In any way, the target address will be available in a register.

Within Thumb1FrameLowering::emitTailCallEpilogue (), I plan to distinguish only 3 cases:

a) target address is in r12. This means, that we would have to generate *very* complicated code for restoring LR *before* poping the callee save-registers r4 ... r7, poping r4...r7, re-adjusting SP for the unused stack slot used for LR and then using BX r12. Instead of this very complicated code, I'd rather pop all callee-save-registers r4...r7 except LR and use a BLX r12; pop {pc} sequence instead. I.e. the stack frame would grow by 4 bytes for the LR not yet popped. Note that this would prevent tail call optimizations as soon as one parameter is passed on the stack. (see side-note below on this issue).

b) target address of TCRETURNri is in r3. I would use "mov r12,r3" for freeing r3 as scratch register, pop all registers except LR and issue a pop {r3}; mov LR,r3; BX r12 sequence.

c) target address of TCRETURNri is not in r3. Assert that r3 is not used for parameters. Use r3 as scratch register as in case b) and proceed.

This code would be slightly sub-optimal in the case b) handling exactly 3 parameters passed in registers if the target address is constant. In this case, loading the target address into r3 could have been done *after* restoring LR contents. However, the penalty in this case would be only the mov r12,r3 instruction and the complexity of handling constant and variable adresses differently and loading the operands to registers would not go into Thumb1FrameLowering::emitTailCallEpilogue ().

QUESTION 2: Is this a good idea or should I rather choose the *very* complicated code version or should the case a) be split into aa) for no parameters passed on the stack (using BLX r12; pop {pc} ) and ab) using the *very* complicated code version

QUESTION 3:

I also would appreciate suggestions on the development enviroment to use. I so far build LLC from the command shell and use Eclipse for debugging and stepping through the code using the local gdb. I'm a bit unhappy of my environment because doing so is extremely slow on my machine. The Eclipse indexer is incredibly slow and frequently references are not correctly resolved in the editor :-(. Also spawning gdb with the llc executable from eclipse takes about 2 minutes for each run ... Waiting for a full build being finished (with clang and the other stuff) takes half a day on my machine. When running make opt and llc seem to be built first by make, so I'm currently canceling the build. But also when intterrupting make after linking llc each build of llc takes about 5 minutes on my machine ... Is there a way for speeding up builds for development? Is there an environment recommended for working on LLVM? (I did not find any hint in the documentation).

Yours,

Björn Haase

Side-note:

It is my understanding that the code

extern void peter(int a,int b, int c, int d, int e);

void
hugo (int a,int b, int c, int d, int e)
{
    peter (a,b,c,d,e);
    return;
}

should in fact be eligible for tail call optimization. However, I notice, that within ARMTargetLowering::IsEligibleForTailCallOptimization
the check for matching stack offsets for parameter #5 fails so that the function is marked not to be eligible for tail call optimizations.

The relevant code portion within ARMTargetLowering::IsEligibleForTailCallOptimization is:

     } else if (!VA.isRegLoc()) {
           if (!MatchingStackOffset(Arg, VA.getLocMemOffset(), Flags,
                                    MFI, MRI, TII))
             return false;
         }

MatchingStackOffset returns false for the 5th parameter. In my eyes this seems to be a bug and not intentional.? Should I file a bug ?