Hi,

The other day on IRC myself, Chandler, Hal and Pablo discussed recursive inlining. I’ve pondered a bit since then, and thought I’d get some background on the list with the aim of stimulating discussion.

Our goal is to inline recursive functions. We have several workloads with recursive calls, and inlining them to a certain degree can improve performance by up to 30%.

Inlining recursive functions is a slightly different problem to inlining non-recursive functions in that the inlining can never terminate, so you need to choose a factor (number of times) to inline. In this respect it is similar to loop unrolling, and Chandler suggested we should model it as such.

We originally thought he meant by this that the mechanics of inlining would stay the same, but the heuristics would have to be changed. This is the approach taken by Rugina & Rinard 1, which is almost the sum of the literature on this topic I could find, which surprised me. However, the actual suggestion was to remove the recursion entirely, and replace it with a loop.

Before I get into the mechanics of this, I’d like to explain what the possible optimization opportunities are for recursive functions turned into a loop.

- Unrolling

- This leads to bigger basic blocks (if they can be merged) and exposes more pipeline-level parallelism.
- CSE between iterations.
- Reduces backbranch overhead (generally negligible impact).

- Commoning / LICM

- Recursion doesn’t have a vehicle for moving common code out into a dominating block (compare with LICM for loops).

OK, first example:

void ex1(i) {

if (i == 0) return;

a[i] = i+1;

ex1(i-1);

}

gets converted to:

while (1) {

if (i == 0) break;

a[i] = i+1;

i = i-1;

}

Now, the above is the simplest case and is pure tail recursion. We can eliminate this currently with the TailRecursionElimination pass. The more complex case is where we have non-tail recusion and live variables.

void ex2(i) {

if (i == 0) return;

j = a[i];

ex2(i-1)

b[j] = i;

}

This requires a stack being modelled and could be converted to:

cnt = 0;

while (1) {

if (i == 0) break;

j = a[i];

stack[cnt++] = j;

stack[cnt++] = i;

++cnt;

–i;

}

while (cnt) {

i = stack[–cnt];

j = stack[–cnt];

b[j] = i;

}

The above is as far as the discussion got on IRC. The following are my half-baked thoughts.

The above transform works because the access pattern to the stack is strictly linear. Pushes are followed by pops. But I don’t think this is the most interesting case for recursion in real programs. I see (qualitatively, poorly) 3 reasons for recursion in code that should go fast:

- A fibonnacci function in a microbenchmark or compiler shootout. I think we can safely ignore this usecase…
- Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
- Traversing a data structure, where the recursion is simply to get to the leaves which is where all the fun stuff happens.

Points 2 and 3 both share a trait that they’re dividing/fanning out at each step:

void ex3(i) {

if (is_base(i)) {

f();

return;

}

g(i);

ex3(i / 2);

ex3(i / 2 + 1);

}

The access pattern of such a function is not linear. It is a pre-order walk of a (binary, in this case) tree. As such we can’t use the two-loop transform above, but would have to form one loop and just manually implement the stack operations:

cnt = 0

stack[cnt++] = i;

while (1) {

i = stack[–cnt];

if (is_base(i)) {

f(); continue;

}

g(i);

stack[cnt++] = i / 2 + 1;

stack[cnt++] = i / 2;

}

OK, it’s still a well-formed loop, we can still do useful operations on it like LICM or, if it’s profitable, unrolling.

Now what about a more complex case, like in the FFT butterfly algorithm:

void ex4(i) {

if (is_base(i)) {

f(i); return;

}

g(i);

ex4(i / 2);

ex4(i / 2 + 1);

h(i);

}

Here, we have to do work after the last recursive call. This presents a problem, because we don’t have any of the caller’s context when dealing with a node. Is a node N the first recursive call? or the second? Call-stack recursion can save two things: the state of variables and the return address which serves as another kind of state. The obvious answer is to turn this into a sort of state machine:

cnt = 0

stack[cnt++] = i;

stack[cnt++] = STATE_1;

while (1) {

state = stack[–cnt];

i = stack[–cnt];

if (state == STATE_FINISH) {

h(i);

continue;

}

if (is_base(i)) {

f(); continue;

}

g(i);

stack[cnt++] = i; // Push ourself again, in the final state this time

stack[cnt++] = STATE_FINISH;

stack[cnt++] = i / 2 + 1;

stack[cnt++] = STATE_1;

stack[cnt++] = i / 2;

stack[cnt++] = STATE_1;

}

This solution can be generalised for if there is code in between the recursive calls (just add more states). It’s getting more complex in terms of control flow though - the question is, is it worth it?

Unrolling this loop is probably not going to usefully create larger basic blocks due to how complex the control flow is. I suppose we have the opportunity, if we unroll, of CSE and jump threading values through equivalent blocks that may gain us time. However, we can still do LICM which is probably a major improvement, and there are no calls/returns to speculate.

This approach would have to be extended if the recursive calls didn’t return void (for example it was a recursive reduction), but it’s not impossible.

Sorry for the long post. What do people think?

Cheers,

James