Where's the optimiser gone? (part 5.b): missed tail calls, and more...

Compile the following functions with "-O3 -target i386"
(see <https://godbolt.org/z/VmKlXL>):

long long div(long long foo, long long bar)
{
    return foo / bar;
}

On the left the generated code; on the right the expected,
properly optimised code:

div: # @div
    push ebp |
    mov ebp, esp |
    push dword ptr [ebp + 20] |
    push dword ptr [ebp + 16] |
    push dword ptr [ebp + 12] |
    push dword ptr [ebp + 8] |
    call __divdi3 | jmp __divdi3
    add esp, 16 |
    pop ebp |
    ret |

long long mod(long long foo, long long bar)
{
    return foo % bar;
}

mod: # @mod
    push ebp |
    mov ebp, esp |
    push dword ptr [ebp + 20] |
    push dword ptr [ebp + 16] |
    push dword ptr [ebp + 12] |
    push dword ptr [ebp + 8] |
    call __moddi3 | jmp __moddi3
    add esp, 16 |
    pop ebp |
    ret |

long long mul(long long foo, long long bar)
{
    return foo * bar;
}

mul: # @mul
    push ebp
    mov ebp, esp
    push esi
    mov ecx, dword ptr [ebp + 16]
    mov esi, dword ptr [ebp + 8]
    mov eax, ecx
    imul ecx, dword ptr [ebp + 12]
    mul esi
    imul esi, dword ptr [ebp + 20]
    add edx, ecx
    add edx, esi
    pop esi
    pop ebp
    ret

Clang’s -target option is supposed to take a cpu type and an operating system. So “-target i386” is giving it no operatiing system. This is preventing frame pointer elimination which is why ebp is being updated. If you pass “-target i386-linux” you get sightly better code.

The division/remainder operations are turned into library calls as part of instruction selection. This code is somewhat independent of how other calls are handled. We probably don’t support tail calls in it. Is it really realistic that a user would have a non-inlined function that contains just a division? Why should we optimize for that case?

Clang's -target option is supposed to take a cpu type and an operating
system. So "-target i386" is giving it no operatiing system. This is
preventing frame pointer elimination which is why ebp is being updated. If
you pass "-target i386-linux" you get sightly better code.

The frame pointer is but not the point here.

The division/remainder operations are turned into library calls as part of
instruction selection. This code is somewhat independent of how other calls
are handled. We probably don't support tail calls in it. Is it really
realistic that a user would have a non-inlined function that contains just
a division? Why should we optimize for that case?

I've seen quite some libraries which implement such functions, calling
just another function having the same prototype, as target-independent
wrappers.
So the question is not whether it's just a division, but in general the
call of a function having the same prototype.

regards
Stefan

Clang’s -target option is supposed to take a cpu type and an operating
system. So “-target i386” is giving it no operatiing system. This is
preventing frame pointer elimination which is why ebp is being updated. If
you pass “-target i386-linux” you get sightly better code.

The frame pointer is but not the point here.

You didn’t provide what you think the improved code would be for the multiply. So I wasn’t sure.

The division/remainder operations are turned into library calls as part of
instruction selection. This code is somewhat independent of how other calls
are handled. We probably don’t support tail calls in it. Is it really
realistic that a user would have a non-inlined function that contains just
a division? Why should we optimize for that case?

I’ve seen quite some libraries which implement such functions, calling
just another function having the same prototype, as target-independent
wrappers.
So the question is not whether it’s just a division, but in general the
call of a function having the same prototype.

We do support that when there is a call in the original source code. The division/remainder case is special because we’re turning an arithmetic operation into a call. This for example works.

long long foo(long long x, long long y) {
return bar(foo, bar);
}