How to optimize conditional recusive functions?

Here is the brief example: Compiler Explorer

Simply, GCC would optimize the following code successfully while clang wouldn’t:

int clang_failed(int x, int y) {
    if (x == 0)
        return y;
        return clang_failed(x - 1, y * x);

GCC would optimize it into a simple form:

clang_failed(int, int):
        movl    %esi, %eax
        testl   %edi, %edi
        je      .L5
        imull   %edi, %eax
        subl    $1, %edi
        jne     .L2

while clang would translate it honestly.

Originally I thought there might be some optimization passes that clang is missing. But I found this is more interesting after I made some simple testing. For example. both GCC and clang would optimize following code:

int success1(int x, int y) {
    if (x == 0)
        return y;
        return success1(x - 1, y++);

But clang would optimize following codes while gcc refused to do:

int gcc_failed(int x, int y) {
    if (x == 0)
        return y;
        return gcc_failed(x - 1, y + x);

BTW, clang would optimize it as:

gcc_failed(int, int):                       # @gcc_failed(int, int)
        mov     eax, esi
        test    edi, edi
        je      .LBB2_2
        lea     ecx, [rdi - 1]
        lea     edx, [rdi - 2]
        imul    rdx, rcx
        imul    ecx, ecx
        shr     rdx
        add     eax, edi
        add     eax, ecx
        sub     eax, edx

which looks a little bit longer than the first version of GCC does. I thought the two codes are homogenized.

So the problem looks interesting. Does anyone knows what’s the pass to optimize it in clang? And if anyone know why clang failed to optimize the first one but succeed with the last one?


It’s not the recursion. That is folded away for this trivial cases.
LLVM doesn’t seem to have a closed form for this recursive/iterative function: Compiler Explorer


Oh, I guess I found the problem. It is related to the phase ordering. The optimization in clang is disabled by unrolling and vectorization. We could observe that clang could optimize it if we disabled the unrolling and vectorization: Compiler Explorer

Do you think it is an issue or a feature? I mean should we try to fix this one by adjusting the pipeline?

Add @aeubanks. Maybe he could help with the phase ordering problem.

I’m not sure what you think the problem is with clang’s output. I mean, the vectorizer is probably underestimating the cost of vectorizing a 32-bit multiply without SSE4.1. That doesn’t have anything to do with phase ordering, though…?

Do you mean that the default output is better than the output with -fno-unroll-loops -fno-slp-vectorize -fno-vectorize? Oh, I didn’t do benchmarking for it… I’ve just feel the output of the later is shorter…