Hello John,
thank you for your feedback.
Some comments on the general approach:
For Tail calls identified during DAG generation, the target address
will
be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the
target address is forced
to hard reg R12 in order to overcome limitations thumb1 register
allocator with respect to
the upper registers.
For the case when r3 is not used for argument passing doing this is
definitely a good idea, but if we have to BX via r12 then we have
to first do an LDR to some other register as there's no instruction
to LDR directly to r12. With current trunk LLVM and your patch for
int calledfn(int,int,int,int);
int test() {
return calledfn(1,2,3,4);
}
I get
test:
push {r4, lr}
movs r0, #1
movs r1, #2
movs r2, #3
movs r3, #4
ldr r4, .LCPI0_0
mov r12, r4
ldr r4, [sp, #4]
mov lr, r4
pop {r4}
add sp, #4
bx r12
r4 has been used to load the target address, which means we have to save
and restore r4 which means we're no better off than not tailcalling.
Concerning speed: Yes. I agree.
For this specific extremely short function: Yes. I agree.
The benefit to expect (in my opinion) is *not* speed or code size. It's stack usage. On a typical cortex-m0 with say 64k Flash, 8k RAM and functions with stack frame of say 128 bytes it actually may easily make a difference if the stack frame is released or not when returning from a function and calling the next one. Actually, I am just working on a communication stack code that does do exactly this very frequently and unfortunately, some tail calls occuring in the biggest stack-suckers happen use 4 parameters :-().
And you can also see the next problem:
During epilog generation, spill register restoring will be done within
the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored
by use of a scratch register
just before restoring the other registers.
POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
(ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.
So I think tailcalling is only worthwhile if the function does not save
lr and r3 is free to hold the target address. Also needing consideration
is what happens if callee-saved registers other than lr need to be saved,
but I haven't looked into this.
I am aware of that and I fully agree with you in that with respect to speed and code size code will not get improved or worse.
Why I still believe tail optimizations to be beneficial in many occasions is my personal experience on real-world projects. The last 15 work on projects on "v6m-scale" microcontroller systems in different companies did teach me that mostly the true issue is not program memory and speed but RAM and stack usage. With respect to a typical RAM / flash ratio of 8k/64k, I'd like to say that 1 byte of RAM spared is roughly "worth" 8 bytes of flash memory. (I know and admit, that the question wether the tail calls would actually save RAM strongly depends on the question whether the biggest stack frame shows up due to tail calls or inner calls.)
If I'd be asked, I'd strongly advocate for a "optimize for RAM" policy for v6m even if making some limited compromises for code size and speed. (see also my other mail from yesterday). I don't think that there is a wrong or right "answer" to this question. It is risky to base the design choice on one single person's bias, so I'd suggest to ask others for their opinion. Maybe there is also some useful benchmarking code around.
If I knew how to do implement it, I'd be suggesting some heuristics at DAG generation for enabling tail call optimization if the expected stack frame size gets larger than, say 32 bytes or a specific compile switch.
A few comments on the patch itself (I've only given it a quick look over):
+ bool IsLRPartOfCalleeSavedInfo = false;
+
+ for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
if (CSI.getReg() == ARM::LR)
- IsV4PopReturn = true;
+ IsLRPartOfCalleeSavedInfo = true;
+ }
+
+ bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
+ if (IsTailCallReturn) {
+ MBBI = MBB.getLastNonDebugInstr();
+
+ // First restore callee saved registers. Unlike for normal returns
+ // this is *not* done in restoreCalleeSavedRegisters.
+ const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo());
+
+ bool IsR4IncludedInCSI = false;
+ bool IsLRIncludedInCSI = false;
+ for (unsigned i = CSI.size(); i != 0; --i) {
+ unsigned Reg = CSI[i-1].getReg();
+ if (Reg == ARM::R4)
+ IsR4IncludedInCSI = true;
+ if (Reg == ARM::LR)
+ IsLRIncludedInCSI = true;
+ }
You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
in setting IsLRIncludedInCSI.
Thank you for pointing out this You are right. That did stem from copy and paste. I previously had this part in the epilogue generation but in the register restoring function, where I did need the duplication.
+ // ARMv4T and tail call returns require BX, see emitEpilogue
+ if ((STI.hasV4TOps() && !STI.hasV5TOps()))
The comment says 'tail call returns require BX', but the code doesn't do that.
Yes. In fact, I did use the tailjump instruction, which actually finally is implemented BX, but I agree, It should better be re-phrased with "require branch address in register".
As an additional question. When doing some execution testing in the simulator, I did stumble myself over an issue with predicates in the instructions.
+ AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+ .addReg(ARM::R4,RegState::Kill));
It seems that the predicate information needs to be placed at a very specific position within the instruction operand list in order to prevent internal compiler errors. For push/pop, it seems that the predicate is required to be the first parameter, for others, there seems to be the requirement to place it directly after the last explicit operand. Is there some documentation on where to place the predicates?
Yours,
Björn.