[GSoC 2016] [Weekly Status] Interprocedural Register Allocation

Dear community,

This is to brief you the progress of Interprocedural Register Allocation, for those who are interested to see the progress in terms of code please consider http://reviews.llvm.org/D20769
This patch contains simple infrastructure to propagate register usage information of callee to caller in call graph. The code generation order is changed to follow bottom up order on call graph , Thanks to Mehdi Amini for the patch for the same ! I will write a blog on this very soon.

So during this week as per the schedule proposed it should be study related infrastructure in LLVM and finalizing an approach for IPRA, but instead I am able to implement a working (may not be fully correct) prototype because I have used community bonding period to discuss and learn related stuffs from the mentors and also due to patch for CodeGen reordering was provided by dear mentor Mehdi Amini.

So I conclude the work done during this week as follows:

Implementation :

Dear Community,

This week I got my patch reviewed by mentors and based on that I have done changes. Also we have identified a problem with callee-saved registers being marked as clobbered registers so we fixed that bug. I have described other minor changes in following section.

It was expected to get the patch committed by end of this week but due to unexpected mistake I was not able to complete writing test cases. Sorry for that.
I had build llvm with ipra enable by default and that build files were on my path ! Due to that next time I tried to build llvm it was terribly slow (almost 1 hour for 10% build ). I spend to much time on fixing this by playing around with environment variables, cmake options etc.
But I think this is a serious concern, we need to think verify this time complexity other wise building a large software with IPRA enable would be very time consuming.

The toughest part for this week was to get lit and FileCheck work as you expect them to work, specially when analysis pass prints info on stdio and there is also a output file generated by llc or opt command.

So here is brief summary :

Implementation:

Dear Community,

The patch for Interprocedural Register Allocation has been committed now , thanks to Mehdi Amini for that. We would like you to play with it and let us know your views and more importantly ideas to improve it.

The test-suite run has indicated some non trivial issue that results in run time failure of the programs, we will be investigating it more. Here are some stats :

test-suite has been tested with IPRA enabled and overall results are not much encouraging. On average 30% increase in compile time. Many programs have also increase in execution time ( average 20%) that is really serious concern for us. About 60 tests have failed on run time this indicates error in compilation. how ever 3 tests have improvement in their runtime and that is 7% average.

This week I think good thing for me to learn is to setup llvm development environment properly other wise one can end up wasting too much time building the llvm it self.

So here is brief summary:
Implementation:

Hi Vivek,

Hi Vivek,

How much of the slow down on runtime comes from the different layout of
the function in the asm file? (I.e., because of the dummy scc pass.)

Hello Quentin,

Please do not consider previous results as there was a major bug in RegMask
calculation due to not considering RegMasks of callee in MF body while
calculating register usage information, that has been fixed now ( as
discussed with Matthias Braun and Mehdi Amini ) and after this bugfix I
have run test-suite with and without IPRA. Yes there is runtime slow down
for some test cases ranging from 1% to 64% similarly compile time slow down
is ranging from 1% to 48%. The runtime performance improvement is ranging
from 1% to 35% and surprisingly there is also compile time improvement in a
range from 1% to 60% . I would request you to go through complete results
at

Also there is not extra failure due to IPRA now so in the result above I
have removed failures.

Sincerely,
Vivek

Hi Vivek,

How much of the slow down on runtime comes from the different layout of
the function in the asm file? (I.e., because of the dummy scc pass.)

Hello Quentin,

Please do not consider previous results as there was a major bug in
RegMask calculation due to not considering RegMasks of callee in MF body
while calculating register usage information, that has been fixed now ( as
discussed with Matthias Braun and Mehdi Amini ) and after this bugfix I
have run test-suite with and without IPRA. Yes there is runtime slow down
for some test cases ranging from 1% to 64% similarly compile time slow down
is ranging from 1% to 48%. The runtime performance improvement is ranging
from 1% to 35% and surprisingly there is also compile time improvement in a
range from 1% to 60% . I would request you to go through complete results
at
https://docs.google.com/document/d/1cavn-POrZdhw-rrdPXV8mSvyppvOWs2rxmLgaOnd6KE/edit?usp=sharing

In above result baseline is IPRA and current is without IPRA. So actually

data with background red is actual improvement and green is regression.
-Vivek

Dear Community,

Please find summary of work done during this week as follow:

Implementation:

Dear Professor,

Thanks to bring this to notice, I tried out a simple test case with indirect function call:

int foo() {
return 12;
}

int bar(int a) {
return foo() + a;
}

int (*fp)() = 0;
int (*fp1)(int) = 0;

int main() {
fp = foo;
fp();
fp1 = bar;
fp1(15);
return 0;
}

and currently IPRA skips optimization for indirect call. But I think this can be handled and I will inform you if I update implementation to cover this. Currently IPRA uses Function * to hold register usage information across the passes, so my hunch is that if from the call instruction for the indirect call, Function * can be derived then it should be able to handle indirection function calls for procedures defined in a current module.

Sincerely,
Vivek

Dear Professor,

Thanks to bring this to notice, I tried out a simple test case with
indirect function call:

int foo() {
return 12;
}

int bar(int a) {
return foo() + a;
}

int (*fp)() = 0;
int (*fp1)(int) = 0;

int main() {
fp = foo;
fp();
fp1 = bar;
fp1(15);
return 0;
}

I have experimented with indirect call specially which are due to use of
function pointers as shown in above example:

Following code in RegUsageInfoPropagate.cpp handles this kind of indirect
calls :

for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB) {
      if (!MI.isCall())
        continue;
      DEBUG(dbgs()
            << "Call Instruction Before Register Usage Info Propagation :
\n");
      DEBUG(dbgs() << MI << "\n");

      auto UpdateRegMask = [&](const Function *F) {
        const auto *RegMask = PRUI->getRegUsageInfo(F);
        if (!RegMask)
          return;
        setRegMask(MI, &(*RegMask)[0]);
        Changed = true;
      };

      MachineOperand &Operand = MI.getOperand(0);
      if (Operand.isGlobal())
        UpdateRegMask(cast<Function>(Operand.getGlobal()));
      else if (Operand.isSymbol())
        UpdateRegMask(M->getFunction(Operand.getSymbolName()));
      else if(Operand.isReg()){
// changes starts here
        unsigned VReg = Operand.getReg();
        MachineBasicBlock::iterator CallInstIterator(&MI);
        MachineBasicBlock *MBB = MI.getParent();
        while(CallInstIterator != MBB->begin() &&
!CallInstIterator->definesRegister(VReg))
          --CallInstIterator;
        DEBUG(dbgs() << "Candidate for indirect call \n");
        if (CallInstIterator != MBB->begin()) {
          for (MachineOperand &MO : (*CallInstIterator).operands()) {
            if (MO.isGlobal()){
              UpdateRegMask(cast<Function>(MO.getGlobal()));
              break;
            }
            else if (Operand.isSymbol()) {
              UpdateRegMask(M->getFunction(MO.getSymbolName()));
              break;
            }
          }
          DEBUG(dbgs() << *CallInstIterator);
        }
      }

      DEBUG(dbgs()
            << "Call Instruction After Register Usage Info Propagation :
\n");
      DEBUG(dbgs() << MI << "\n");
    }
  }

So I would like to have mentors' review/suggestions on this
For virtual function kind of case we have to think differently, Is this a
valid approach to deal with indirect calls ?
Please let me know your thoughts.

-Vivek

Hi Vivek,

> int foo() {
> return 12;
> }
>
> int bar(int a) {
> return foo() + a;
> }
>
> int (*fp)() = 0;
> int (*fp1)(int) = 0;
>
> int main() {
> fp = foo;
> fp();
> fp1 = bar;
> fp1(15);
> return 0;
> }

IMO it is waste of time trying to do a better job at the IPRA level on
IR like the above ^. LLVM should be folding the indirect calls to
direct calls at the IR level, and if it isn't that's a bug in the IR
level optimizer.

The interesting cases are when you have a call like:

   fnptr target = object->callback;
   target(foo, bar);

and the IR level optimizer has failed to optimize the indirect call to
`target` to a direct call.

-- Sanjoy

+1 from me.

The interesting cases are the non-obvious ones (assumeing foo/bar have the same parameters). Things gets interesting once you have uncertainty in the mix. The minimally interesting case would look like this:

int main() {
    int (*fp)();
    if (rand()) {
        fp = foo;
    } else {
        fp = bar;
    }
    fp(42);
}

However predicting the possible targets of a call is IMO a question of computing a call graph datastructure and improving upon that. We should be sure that we discuss and implement this independently of the register allocation work!

- Matthias

>
> Hi Vivek,
>
> > int foo() {
> > return 12;
> > }
> >
> > int bar(int a) {
> > return foo() + a;
> > }
> >
> > int (*fp)() = 0;
> > int (*fp1)(int) = 0;
> >
> > int main() {
> > fp = foo;
> > fp();
> > fp1 = bar;
> > fp1(15);
> > return 0;
> > }
>
> IMO it is waste of time trying to do a better job at the IPRA level on
> IR like the above ^. LLVM should be folding the indirect calls to
> direct calls at the IR level, and if it isn't that's a bug in the IR
> level optimizer.
+1 from me.

Yes at -O3 level simple indirect calls including virtual functions are

getting optimized to direct call.

The interesting cases are the non-obvious ones (assumeing foo/bar have the
same parameters). Things gets interesting once you have uncertainty in the
mix. The minimally interesting case would look like this:

int main() {
    int (*fp)();
    if (rand()) {
        fp = foo;
    } else {
        fp = bar;
    }
    fp(42);
}

I tried this case and my simple hack fails to optimize it :slight_smile: . This
requires discussion on IRC.

Sincerely,
-Vivek

Hello LLVM Developers,

Please follow summary of work done during this week.

Implementation:

Hello LLVM Developers.

This week much of my time is consumed in debugging IPRA’s effect on higher level optimization specifically due to not having callee saved registers. I think it was hard but I learned a lot and LLDB helped me a lot.

Here is summary for this week:

Implementation:

Hello LLVM Developers,

Please feel free to send any ideas that you can think to improve current IPRA. I will work on it and if possible I will implement that.

Please consider summary of work done during this week.

Implementation:

Dear Community,

Sorry for being late for weekly status.

Please find summary for this week as below:

Implementation

Dear Community,

Sorry for being late for weekly status report but I was traveling back to my college.

Please consider this week’s summary as follows:

Implementation:

Hi Vivek,

Dear Community,

I hope you have gone through the results for IPRA. Please feel free to share your thoughts on the spreadsheet.

Please consider this week’s work as follow:

Implementation: