RFC: Recursive inlining

Hi,

Apologies for the very late response.

We have manually tried the idea with a very simple Fibonacci sequence code. While being very very simple, the recursion cannot be handled by TRE. Because there are two recursive callsites, it also needs to keep some sort of state across iterations of the "while(stack not empty)" loop.

We get between 2.5 and 8x slowdowns depending on which processor, and the branch mispredictions are huge (specially in x86: around 100 times higher). We are replacing the function call mechanism by a stack + switch statement to mimic the control flow for callsites. My understanding is that the branches in the switch are hardly predictable: they depend on a value taken from the stack at each iteration. Things could get harder as we increase the number of recursive callsites in a function (we increase the size of the switch).

Also, virtually all processors beyond a certain level of sophistication (let's say, meant to run an OS) feature a return-stack prediction mechanism. This results in highly-efficient branch prediction for calls+returns. Since we are transforming the "call jumps" into "switch jumps", we are probably misusing that piece of hardware.

Of course, this is not to say that this is a dead end. I paste the code here (original and "unrolled" version) so that you see the problem, and maybe you can spot something that I missed. Put your favorite queue implementation in there.

Cheers,
Pablo

unsigned fibonacci(unsigned n){

  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;
  else
    return fibonacci(n - 1) + fibonacci(n - 2);
}

unsigned fibonacci_opt(unsigned n){

  // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call
  enum {S0, S1, S2};

  Stack s;
  int ret;
  init(&s, 256);
  push(&s, n);
  push(&s, S0);

  while (!empty(s)){

    /* print(&s); */

    int state = pop(&s);
    n = pop(&s);

    int a, b;

    switch(state){
    case S0:

      if (n <= 0)
        ret = 0;

      else if (n == 1)
        ret = 1;

      else{
        // On "fake return" we need to move to the end of the first "recursive call".
        push(&s, n);
        push(&s, S1);

        // Go for the first "recursive call"
        push(&s, n - 1);
        push(&s, S0);
      }
      break;

    case S1:
      // On "fake return", we want to recover the return of the previous call.
      push(&s, ret);
      push(&s, S2);

      // Go for the second "recursive call"
      push(&s, n - 2);
      push(&s, S0);
      break;

    case S2:
      ret += n;
    }
  }

  delete(&s);
  return ret;
}