bx instruction getting generated in arm assembly for O1

Hi Jonathan,

The assembly generated in case of clang-3.5 is

indirect_call:
.fnstart
.Leh_func_begin0:
ldr r0, .LCPI0_0
ldr r1, .LCPI0_1
.LPC0_0:
add r0, pc, r0
ldr r0, [r1, r0]
ldr r0, [r0]
bx r0
.align 2
.LCPI0_0:
.long GLOBAL_OFFSET_TABLE-(.LPC0_0+8)
.LCPI0_1:
.long indirect_func(GOT)
.Ltmp0:
.size indirect_call, .Ltmp0-indirect_call
.Leh_func_end0:
.fnend

with clang-3.4.2 the assembly generated is:

ndirect_call:
push {r11, lr}
ldr r0, .LCPI0_0
mov r11, sp
ldr r1, .LCPI0_1
.LPC0_0:
add r0, pc, r0
ldr r0, [r1, r0]
ldr r0, [r0]
blx r0
pop {r11, pc}
.align 2
.LCPI0_0:
.long GLOBAL_OFFSET_TABLE-(.LPC0_0+8)
.LCPI0_1:
.long indirect_func(GOT)
.Ltmp0:
.size indirect_call, .Ltmp0-indirect_call

Both assemblies are generated with O1 optimization. The assembly generated with trunk version of clang is similar to 3.5

Thanks,

Mayur

------- Original Message -------

Sender : Jonathan Roelofsjonathan@codesourcery.com

Title : Re: [LLVMdev] bx instruction getting generated in arm assembly for O1

Hi,

For the following test:

int (*indirect_func)();

int indirect_call()
{
return indirect_func();
}

when generating the assembly with clang-3.5, for -march=armv5te, there is a
difference in the assemblies generated with O0 and O1:

In the assembly generated with O0, we are getting the “blx” instruction whereas
with O1 we get “bx” (in 3.4.2 we used to get “blx” for both O0 and O1).
Can you post the asm that you’re seeing for this function?

There’s a related case to this on armv4t which Iain has a patch for, that I
think we forgot about… The problem there is that armv4t doesn’t have blx at
all, so should be generating a sequence like: ‘mov r0, …; bx _Ltmp; _Ltmp: bl r0’.

Is this because of this patch: [llvm] r214959 - ARM: do not generate BLX
instructions on Cortex-M CPUs
I doubt it. armv5te isn’t a cortex-m processor.

Cheers,

Jon

Again, this looks correct - the only difference is that the first version is better optimised. r11 is spilled because it is not used to store the stack pointer. The following:

blx r0
pop {r11, pc}

Is restoring r11 and jumping to the saved link register (and adjusting the stack pointer: you've got to love AArch32 assembly, where a jump, stack pointer adjustment, and register reload is a single instruction). If r11 is not spilled, then we're left with:

push lr
...
blx r0
pop pc

And this is equivalent to simply:

bx r0

So, again, what is the bug that your test is testing for? Or are you just checking that clang 3.5 really is doing tail-call optimisation in trivial cases?

David