[RFC] [PATCH] add tail call optimization to thumb1-only targets

Hello,

find enclosed a first patch for adding tail call optimizations for thumb1 targets.
I assume that this list is the right place for publishing patches for review?

Since this is my first proposal for LLVM, I'd very much appreciate your feedback.

What the patch is meant to do:

For Tail calls identified during DAG generation, the target address will be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the target address is forced
to hard reg R12 in order to overcome limitations thumb1 register allocator with respect to
the upper registers.

I decided to fetch the target address to a register by a constant pool lookup because when
analyzing the code I found out, that the mechanisms are prepared also for situations, where
parameters are both passed in regs and on the stack. This would not be possible when using
a BL // pop {pc} sequence within the epilogue since this would change the stack offsets.

During epilog generation, spill register restoring will be done within the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored by use of a scratch register
just before restoring the other registers.

I have so far tested the code by hand with a number of tests by analyzing generated assembly.
In the lit testsuite I get 4 failures which I attribute at a first analysis to the fact that the generated code for tail calls
results in different output that no longer matches the expectation strings.

Yours,

Björn

sampleTests.tar.gz (1.09 KB)

thumb1TailCallOpt.patch (11.4 KB)

Some comments on the general approach:

For Tail calls identified during DAG generation, the target address
will
be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the
target address is forced
to hard reg R12 in order to overcome limitations thumb1 register
allocator with respect to
the upper registers.

For the case when r3 is not used for argument passing doing this is
definitely a good idea, but if we have to BX via r12 then we have
to first do an LDR to some other register as there's no instruction
to LDR directly to r12. With current trunk LLVM and your patch for

  int calledfn(int,int,int,int);
  int test() {
    return calledfn(1,2,3,4);
  }

I get

test:
        push {r4, lr}
        movs r0, #1
        movs r1, #2
        movs r2, #3
        movs r3, #4
        ldr r4, .LCPI0_0
        mov r12, r4
        ldr r4, [sp, #4]
        mov lr, r4
        pop {r4}
        add sp, #4
        bx r12

r4 has been used to load the target address, which means we have to save
and restore r4 which means we're no better off than not tailcalling. And
you can also see the next problem:

During epilog generation, spill register restoring will be done within
the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored
by use of a scratch register
just before restoring the other registers.

POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
(ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.

So I think tailcalling is only worthwhile if the function does not save
lr and r3 is free to hold the target address. Also needing consideration
is what happens if callee-saved registers other than lr need to be saved,
but I haven't looked into this.

A few comments on the patch itself (I've only given it a quick look over):

+ bool IsLRPartOfCalleeSavedInfo = false;

Bjoern,

Thanks for the patch, this will be a really nice optimization to have :slight_smile:

Would you mind using Phabricator for this? http://reviews.llvm.org/ It will make reviewing a bit easier on our end. It also helps if there's a lot of context in the diff, so `git diff -U999` is the way to go.

A few style nits:
     Please use 2-spaces instead of tabs for indentation.
     The new variable names seem a bit long compared to the rest of LLVM.

Please write some LIT test cases for this (including at least cases where R4 is and is not in CSI).

Correctness-wise, I'm a little concerned about forcing the call address into R12, as that might clash with the register scavenger (take that with a healthy dose of skepticism; I don't *know* that there's a problem here, rather I'm just uncertain how the two will interact).

Cheers,

Jon

Hello John,

thank you for your feedback.

Some comments on the general approach:

For Tail calls identified during DAG generation, the target address
will
be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the
target address is forced
to hard reg R12 in order to overcome limitations thumb1 register
allocator with respect to
the upper registers.

For the case when r3 is not used for argument passing doing this is
definitely a good idea, but if we have to BX via r12 then we have
to first do an LDR to some other register as there's no instruction
to LDR directly to r12. With current trunk LLVM and your patch for

   int calledfn(int,int,int,int);
   int test() {
     return calledfn(1,2,3,4);
   }

I get

test:
         push {r4, lr}
         movs r0, #1
         movs r1, #2
         movs r2, #3
         movs r3, #4
         ldr r4, .LCPI0_0
         mov r12, r4
         ldr r4, [sp, #4]
         mov lr, r4
         pop {r4}
         add sp, #4
         bx r12

r4 has been used to load the target address, which means we have to save
and restore r4 which means we're no better off than not tailcalling.

Concerning speed: Yes. I agree.
For this specific extremely short function: Yes. I agree.

The benefit to expect (in my opinion) is *not* speed or code size. It's stack usage. On a typical cortex-m0 with say 64k Flash, 8k RAM and functions with stack frame of say 128 bytes it actually may easily make a difference if the stack frame is released or not when returning from a function and calling the next one. Actually, I am just working on a communication stack code that does do exactly this very frequently and unfortunately, some tail calls occuring in the biggest stack-suckers happen use 4 parameters :-().

And you can also see the next problem:

During epilog generation, spill register restoring will be done within
the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored
by use of a scratch register
just before restoring the other registers.

POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
(ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.

So I think tailcalling is only worthwhile if the function does not save
lr and r3 is free to hold the target address. Also needing consideration
is what happens if callee-saved registers other than lr need to be saved,
but I haven't looked into this.

I am aware of that and I fully agree with you in that with respect to speed and code size code will not get improved or worse.

Why I still believe tail optimizations to be beneficial in many occasions is my personal experience on real-world projects. The last 15 work on projects on "v6m-scale" microcontroller systems in different companies did teach me that mostly the true issue is not program memory and speed but RAM and stack usage. With respect to a typical RAM / flash ratio of 8k/64k, I'd like to say that 1 byte of RAM spared is roughly "worth" 8 bytes of flash memory. (I know and admit, that the question wether the tail calls would actually save RAM strongly depends on the question whether the biggest stack frame shows up due to tail calls or inner calls.)

If I'd be asked, I'd strongly advocate for a "optimize for RAM" policy for v6m even if making some limited compromises for code size and speed. (see also my other mail from yesterday). I don't think that there is a wrong or right "answer" to this question. It is risky to base the design choice on one single person's bias, so I'd suggest to ask others for their opinion. Maybe there is also some useful benchmarking code around.

If I knew how to do implement it, I'd be suggesting some heuristics at DAG generation for enabling tail call optimization if the expected stack frame size gets larger than, say 32 bytes or a specific compile switch.

A few comments on the patch itself (I've only given it a quick look over):

+ bool IsLRPartOfCalleeSavedInfo = false;
+
+ for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
      if (CSI.getReg() == ARM::LR)
- IsV4PopReturn = true;
+ IsLRPartOfCalleeSavedInfo = true;
+ }
+
+ bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
    IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
  + if (IsTailCallReturn) {
+ MBBI = MBB.getLastNonDebugInstr();
+
+ // First restore callee saved registers. Unlike for normal returns
+ // this is *not* done in restoreCalleeSavedRegisters.
+ const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo());
+
+ bool IsR4IncludedInCSI = false;
+ bool IsLRIncludedInCSI = false;
+ for (unsigned i = CSI.size(); i != 0; --i) {
+ unsigned Reg = CSI[i-1].getReg();
+ if (Reg == ARM::R4)
+ IsR4IncludedInCSI = true;
+ if (Reg == ARM::LR)
+ IsLRIncludedInCSI = true;
+ }

You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
in setting IsLRIncludedInCSI.

Thank you for pointing out this You are right. That did stem from copy and paste. I previously had this part in the epilogue generation but in the register restoring function, where I did need the duplication.

+ // ARMv4T and tail call returns require BX, see emitEpilogue
+ if ((STI.hasV4TOps() && !STI.hasV5TOps()))

The comment says 'tail call returns require BX', but the code doesn't do that.

Yes. In fact, I did use the tailjump instruction, which actually finally is implemented BX, but I agree, It should better be re-phrased with "require branch address in register".

As an additional question. When doing some execution testing in the simulator, I did stumble myself over an issue with predicates in the instructions.

+ AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+ .addReg(ARM::R4,RegState::Kill));

It seems that the predicate information needs to be placed at a very specific position within the instruction operand list in order to prevent internal compiler errors. For push/pop, it seems that the predicate is required to be the first parameter, for others, there seems to be the requirement to place it directly after the last explicit operand. Is there some documentation on where to place the predicates?

Yours,

Björn.

Forcing tailcalls, even when it isn’t profitable in terms of performance (or space in the case of

-Os, though I can’t off the top of my head think of any case where a faster tail call would also

be larger), is what the -tailcallopt option is for: see

http://llvm.org/docs/CodeGenerator.html#tail-call-optimization

John

That link says “currently supported on x86/x86-64 and PowerPC”. The purpose of the current patch appears to be adding support for this flag for Thumb-1?

(with the implication that it is already supported for Thumb-2 and ARM32 and the documentation is out of date)

I could, of course, be confused.