Small problem with the tail call elimination pass

Hello everybody,

On the documentation page for the tailcallelim pass you can read:

"This pass transforms functions that are prevented from being tail recursive by an
associative expression to use an accumulator variable, thus compiling the typical
naive factorial or fib implementation into efficient code”

However, I don’t see this behavior when trying to compile this variant of
the mentioned “typical naive fib implementation”:

unsigned int fib(unsigned int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}

The IR with clang -O3 (version 3.4) is this:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 2
  br i1 %1, label %8, label %2

; <label>:2 ; preds = %0
  %3 = add i32 %n, -1
  %4 = tail call i32 @_Z9fibj(i32 %3)
  %5 = add i32 %n, -2
  %6 = tail call i32 @_Z9fibj(i32 %5)
  %7 = add i32 %6, %4
  ret i32 %7

; <label>:8 ; preds = %0
  ret i32 %n
}

You see that the tail call is not eliminated. However, it does get eliminated
if I change to this code:

unsigned int fib(unsigned int n) {
    return n <= 2 ? 1 : fib(n-1) + fib(n-2);
}

IR:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 3
  br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse: ; preds = %0, %tailrecurse
  %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
  %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
  %2 = add i32 %n.tr2, -1
  %3 = tail call i32 @_Z9fibj(i32 %2)
  %4 = add i32 %n.tr2, -2
  %5 = add i32 %3, %accumulator.tr1
  %6 = icmp ult i32 %4, 3
  br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse._crit_edge: ; preds = %tailrecurse, %0
  %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
  ret i32 %accumulator.tr.lcssa
}

Now, note that gcc 4.8 compiles well both the examples, and the
resulting loop gets also unrolled a dozen times.

I’m compiling with -O3 with both compilers.

So what is stopping the tail call elimination pass to doing the same
thing?

Note that I don’t write this email because I need a performing
fibonacci function… What I’m trying to do is to track down the
reason why fortran code is performing better than C on the
source code used as benchmark in the main page of the
Julia language in [1]. I wasn’t expecting this huge difference
between fortran and C code on an example as simple as this, so I
tried to investigate, even if I don’t know anything about fortran.
Their source code use int instead of unsigned as in my example.
I don’t know quite why, but this slows down everything both in gcc
and clang, because of something related to how the two different
languages handle overflow in integer arithmetic, I think. When I use
unsigned int, gcc performance matches gfortran’s, and the generated
code is nearly identical, while clang stays way behind because of the
missing tail call elimination.

Note that even with the modified example where the tail call does
get eliminated, a simple benchmark shows that clang’s code
is still 2-3x times slower than gcc's, I suppose because of the missing
unrolling.

Note that in that benchmark, Julia seems winning too, and they use
LLVM as the backend so something’s missing...

Can you help me understand what’s happening?

Thank you,
Nicola

[1] http://julialang.org

Hello everybody,

On the documentation page for the tailcallelim pass you can read:

"This pass transforms functions that are prevented from being tail
recursive by an
associative expression to use an accumulator variable, thus compiling the
typical
naive factorial or fib implementation into efficient code”

However, I don’t see this behavior when trying to compile this variant of
the mentioned “typical naive fib implementation”:

unsigned int fib(unsigned int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}

The IR with clang -O3 (version 3.4) is this:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 2
  br i1 %1, label %8, label %2

; <label>:2 ; preds = %0
  %3 = add i32 %n, -1
  %4 = tail call i32 @_Z9fibj(i32 %3)
  %5 = add i32 %n, -2
  %6 = tail call i32 @_Z9fibj(i32 %5)
  %7 = add i32 %6, %4
  ret i32 %7

; <label>:8 ; preds = %0
  ret i32 %n
}

You see that the tail call is not eliminated. However, it does get
eliminated
if I change to this code:

unsigned int fib(unsigned int n) {
    return n <= 2 ? 1 : fib(n-1) + fib(n-2);
}

IR:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 3
  br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse: ; preds = %0,
%tailrecurse
  %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
  %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
  %2 = add i32 %n.tr2, -1
  %3 = tail call i32 @_Z9fibj(i32 %2)
  %4 = add i32 %n.tr2, -2
  %5 = add i32 %3, %accumulator.tr1
  %6 = icmp ult i32 %4, 3
  br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse._crit_edge: ; preds = %tailrecurse,
%0
  %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
  ret i32 %accumulator.tr.lcssa
}

Now, note that gcc 4.8 compiles well both the examples, and the
resulting loop gets also unrolled a dozen times.

I’m compiling with -O3 with both compilers.

So what is stopping the tail call elimination pass to doing the same
thing?

At the moment TRE runs, the function reads:

; Function Attrs: nounwind readnone
define i32 @fib(i32 %n) #0 {
entry:
  %cmp = icmp ult i32 %n, 2
  br i1 %cmp, label %cond.end, label %cond.false

cond.false: ; preds = %entry
  %sub = add i32 %n, -1
  %call = tail call i32 @fib(i32 %sub)
  %sub1 = add i32 %n, -2
  %call2 = tail call i32 @fib(i32 %sub1)
  %add = add i32 %call, %call2
  br label %cond.end

cond.end: ; preds = %entry,
%cond.false
  %cond = phi i32 [ %add, %cond.false ], [ %n, %entry ]
  ret i32 %cond
}

TRE starts at the 'ret' and looks upwards and checks all of that block's
predecessors to find any that end in unconditional branches. Then it walks
upwards in such blocks to find if there's a "TRE candidate" which is a call
to itself (but aborts if there are instructions is can't handle between the
call and the end of the block), and manages to find %call2 -- and calls
EliminateRecursiveTailCall on it. Good job! At this stage, it transforms
the "br label %cond.end" into a "ret i32 %add" and updates the phi in
%cond.end. There are now *two* returns, the one in %cond.end is "ret i32
%n".

EliminateRecursiveTailCall detects that TRE'ing this requires performing
accumulator recursion elimination. To do this, it checks that "the function
containing the specified tail call consistently returns the same
runtime-constant value" (at all exit points except for our ret at the end
of %cond.false). Now that we have a ret i32 %n, this is false; the safety
check finds that the argument for %n can change because we pass in %sub and
%sub1 into ourselves instead of always passing %n to ourselves. That's what
stops the transformation.

Your working example makes two modifications, you change "n < 2" to "n <=
2" and you change "? n" to "? 1". It's the second of these that will make
the difference, because it changes %cond to:

  %cond = phi i32 [ %add, %cond.false ], [ 1, %entry ]

Everything proceeds the same way as above, except that when the branch in
%cond.false is replaced with a return, the return in %cond.end folds to
"ret i32 1", a constant. Being the same value each time, this is simple
enough for accumulator recursion elimination.

Nick

PS. In case you aren't aware, I recently bemoaned the state of our TRE
abilities here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140505/216092.html

Note that I don’t write this email because I need a performing

I don’t know who wrote that description of TRE, but your fib implementation is obviously not tail recursive, so I don’t know what we could do with it. Maybe the author had this version in mind:

int fib_help(int a, int b, int n) {
if (n > 0)
return fib_help(b, a+b, n-1);
return a;
}
int fib(int n) {
return fib_help(0, 1, n);
}

TRE can turn the helper into a loop here.

Thank you very much for the explanation! It all makes sense now.
However, the point remains that gcc (compiling down from both C or fortran) can indeed optimize this code despite the non-constant
base case. Do you or someone know what reasoning does it apply?
Should I file a bug report?

Very interesting!

Thanks,
Nicola

Nicola Gigante wrote:

At the moment TRE runs, the function reads:

; Function Attrs: nounwind readnone
define i32 @fib(i32 %n) #0 {
entry:
%cmp = icmp ult i32 %n, 2
br i1 %cmp, label %cond.end, label %cond.false

cond.false: ; preds = %entry
%sub = add i32 %n, -1
%call = tail call i32 @fib(i32 %sub)
%sub1 = add i32 %n, -2
%call2 = tail call i32 @fib(i32 %sub1)
%add = add i32 %call, %call2
br label %cond.end

cond.end: ; preds = %entry, %cond.false
%cond = phi i32 [ %add, %cond.false ], [ %n, %entry ]
ret i32 %cond
}

TRE starts at the 'ret' and looks upwards and checks all of that
block's predecessors to find any that end in unconditional branches.
Then it walks upwards in such blocks to find if there's a "TRE
candidate" which is a call to itself (but aborts if there are
instructions is can't handle between the call and the end of the
block), and manages to find %call2 -- and calls
EliminateRecursiveTailCall on it. Good job! At this stage, it
transforms the "br label %cond.end" into a "ret i32 %add" and updates
the phi in %cond.end. There are now *two* returns, the one in
%cond.end is "ret i32 %n".

EliminateRecursiveTailCall detects that TRE'ing this requires
performing accumulator recursion elimination. To do this, it checks
that "the function containing the specified tail call consistently
returns the same runtime-constant value" (at all exit points except
for our ret at the end of %cond.false). Now that we have a ret i32 %n,
this is false; the safety check finds that the argument for %n can
change because we pass in %sub and %sub1 into ourselves instead of
always passing %n to ourselves. That's what stops the transformation.

Your working example makes two modifications, you change "n < 2" to "n
<= 2" and you change "? n" to "? 1". It's the second of these that
will make the difference, because it changes %cond to:

%cond = phi i32 [ %add, %cond.false ], [ 1, %entry ]

Everything proceeds the same way as above, except that when the branch
in %cond.false is replaced with a return, the return in %cond.end
folds to "ret i32 1", a constant. Being the same value each time, this
is simple enough for accumulator recursion elimination.

Thank you very much for the explanation! It all makes sense now.
However, the point remains that gcc (compiling down from both C or
fortran) can indeed optimize this code despite the non-constant
base case. Do you or someone know what reasoning does it apply?

The short answer is that their accumulator recursion elimination is more capable.

GCC keeps two accumulators, one for multiplicative factors and one for additive factors. They handle TRE where they produce "add + mult * f()" where f() is the tail call of itself. I haven't put thought into it to figure out whether that's what we should do or if we can do it better.

Should I file a bug report?

Sure.

Nick