A question about code generated by clang

I’m sorry if this is an obvious question, but I just tried to compile the following function with -O3:

void g(uint64_t);
void f(uint64_t x) {
    g(x);
    g(x);
}

and the assembly generated by clang is:

f(unsigned long):                                  # @f(unsigned long)
        pushq   %rbx
        movq    %rdi, %rbx
        callq   g(unsigned long)
        movq    %rbx, %rdi
        popq    %rbx
        jmp     g(unsigned long)                           # TAILCALL

For the first call, why wouldn’t it generate “push rdi; call g; pop rdi”, but instead generate the 5 instructions above? It seems to me the former version is both shorter and more efficient…? I think I must be missing something obvious, but I couldn’t figure it out myself.

Thanks for your help!

The push and pop are part of much lower level code that gets inserted at the beginning and end of each function. Its main job is to save any registers that need saving, allocate stack space, and make sure the stack is properly 16-byte aligned before any call.

This happens really quite late, after most optimizations have happened, so there isn’t really an opportunity to rework things.

Thanks for the reply! So what you are saying is that LLVM thinks this is a case that is not worth optimizing (at the cost of the extra engineering complexity introduced). Is this understanding correct?