LiveVariables/LiveInterval on huge functions

Hi,

In PR2193 LiveVariables runs out of memory on a 512M limit, after
processing 11557 basicblocks.
VirtRegInfo has ~180000 entries with ~700 bytes each.
If I give it more memory (1.5G) it runs out of memory in LiveInterval.

I don't see any easy solution to reduce memory usage, but should we
optimize such a huge function at once?
If the function is over some reasonable limit (5k basic-blocks?) we
could split it into smaller functions, and avoid
these problems.

I am writing a pass to do this split at llvm IR level. What do you think?

--Edwin

Hi,

In PR2193 LiveVariables runs out of memory on a 512M limit, after
processing 11557 basicblocks.
VirtRegInfo has ~180000 entries with ~700 bytes each.
If I give it more memory (1.5G) it runs out of memory in LiveInterval.

Some of the information kept by LiveVariables are somewhat redundant and can be removed. I think there was talk of folding much of its computation into LiveIntervalAnalysis. It's not clear whether that would make a difference without careful analysis.

It's not clear to me what to do about LiveInterval. It does keep a lot of information but it's all essential.

I suspect StrongPHIElimination may help the situation somewhat but again I can't guess its impact.

I don't see any easy solution to reduce memory usage, but should we
optimize such a huge function at once?
If the function is over some reasonable limit (5k basic-blocks?) we
could split it into smaller functions, and avoid
these problems.

But what is the right limit? How do you estimate the limit given different host configuration? If you can make the decision earlier, than it can default to the local register allocator path which doesn't require LiveIntervalAnalysis.

I am writing a pass to do this split at llvm IR level. What do you think?

This functionality sounds useful, but I am not sure it's the *right* fix. We should definitely try harder to make codegen more scalable. But it's not a trivial problem.

Evan

Evan Cheng wrote:

Hi,

In PR2193 LiveVariables runs out of memory on a 512M limit, after
processing 11557 basicblocks.
VirtRegInfo has ~180000 entries with ~700 bytes each.
If I give it more memory (1.5G) it runs out of memory in LiveInterval.
    
Some of the information kept by LiveVariables are somewhat redundant
and can be removed. I think there was talk of folding much of its
computation into LiveIntervalAnalysis. It's not clear whether that
would make a difference without careful analysis.

It's not clear to me what to do about LiveInterval. It does keep a lot
of information but it's all essential.

I suspect StrongPHIElimination may help the situation somewhat but
again I can't guess its impact.

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?

I don't see any easy solution to reduce memory usage, but should we
optimize such a huge function at once?
If the function is over some reasonable limit (5k basic-blocks?) we
could split it into smaller functions, and avoid
these problems.
    
But what is the right limit? How do you estimate the limit given
different host configuration?

I think having a limit is better than having none. It will prevent
memory usage to get out of control.
The memory usage of LiveVariables/LiveInterval looks like O(N^2) to me,
if we limit N, the memory usage can differ at most by a constant factor.
Having a function twice the size, won't suddenly require 4x the memory.
Determining a right limit, to fit exactly into a given amount of memory
is hard.
We could gather some statistics on medium and maximum basic-block count
in applications from llvm-test. Then choose a default value, and allow
the user to override via a command-line option.

For example once gcc once gave me a warning, something like: more than
15000 BBs, disabling gcse (IIRC, I can't find the source file that gave
me that warning anymore).
Disabling an optimization entirely doesn't seem the best choice to me, I
think it is better if we split it into smaller functions, and optimize
those.

But maybe the right solution is to avoid generating functions that
large, instead of splitting it afterwards.

If you can make the decision earlier,
than it can default to the local register allocator path which doesn't
require LiveIntervalAnalysis.
  
Yep, the local allocator works.
How much worse is the performance of generated code with the local
allocator vs. the default?

I am writing a pass to do this split at llvm IR level. What do you
think?
    
This functionality sounds useful, but I am not sure it's the *right*
fix. We should definitely try harder to make codegen more scalable.
But it's not a trivial problem.

Good point. If we have the size-limiting pass, we could become lazy and
not care of scalability too much. So maybe it is better if I don't write
it :wink:

Best regards,
--Edwin

Török Edwin wrote:

Evan Cheng wrote:
  

Hi,

In PR2193 LiveVariables runs out of memory on a 512M limit, after
processing 11557 basicblocks.
VirtRegInfo has ~180000 entries with ~700 bytes each.
If I give it more memory (1.5G) it runs out of memory in LiveInterval.
    

Some of the information kept by LiveVariables are somewhat redundant
and can be removed. I think there was talk of folding much of its
computation into LiveIntervalAnalysis. It's not clear whether that
would make a difference without careful analysis.

It's not clear to me what to do about LiveInterval. It does keep a lot
of information but it's all essential.

I suspect StrongPHIElimination may help the situation somewhat but
again I can't guess its impact.

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?

most of functions called by SelectCode get a -30000 cost reduction
because they are internal.
Even if Caller.size() is 40000, the penalty is only 2000, thus we still
have negative costs.

Removing the /20 factor from here improves the situation a lot, however
I think there is a reason for /20, and it can't be just removed:
InlineCost += Caller->size()/20;

Best regards,
--Edwin

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?

Some functions are just big. Consider production versions of the WRF code
from spec 2006. It is machine-generated and can be HUGE!

>> I am writing a pass to do this split at llvm IR level. What do you
>> think?
>
> This functionality sounds useful, but I am not sure it's the *right*
> fix. We should definitely try harder to make codegen more scalable.
> But it's not a trivial problem.

Good point. If we have the size-limiting pass, we could become lazy and
not care of scalability too much. So maybe it is better if I don't write
it :wink:

Another option is regioned compilation. Generally, I think degrading
optimization (using a local allocator rather than a global one) is hackish.
Regioning does degrade optimization due to visibility limits, but it doesn't
require special-casing which flavors of optimization are run based on code
size.

I would very much like to see some kind of regioning make its way into llvm
at some point. I'd do it myself if I had free cycles, which I don't. :frowning:

                                                         -Dave

This sounds like unanticipated fallout from Evan's recent tweaks of the inliner. Evan, thoughts?

-Chris

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?

most of functions called by SelectCode get a -30000 cost reduction
because they are internal.
Even if Caller.size() is 40000, the penalty is only 2000, thus we still
have negative costs.

Removing the /20 factor from here improves the situation a lot, however
I think there is a reason for /20, and it can't be just removed:
InlineCost += Caller->size()/20;

This sounds like unanticipated fallout from Evan's recent tweaks of the inliner. Evan, thoughts?

Previously the inliner assign each basic block cost of 20. So this line is simply estimating the number of caller basic blocks. My tweak simply removed the number of basic blocks from the equation so the cost of a callee is simply number of instructions * 5. I don't think it should / would impact this case. Edwin, can you revert 49061 and 48725 to see if they have any impact?

The -30000 cost reduction for internal function does seem excessive though.

Evan

Right, so now the cost estimate of the function is much lower than it was before. This isn't itself a problem, but it means that the 30000 bonus for being called at one callsite should also be reduced to match, seem reasonable?

-Chris

Sure, it makes sense. I'll take a look at it.

Evan

Evan Cheng wrote:

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?
        

most of functions called by SelectCode get a -30000 cost reduction
because they are internal.
Even if Caller.size() is 40000, the penalty is only 2000, thus we
still
have negative costs.

Removing the /20 factor from here improves the situation a lot,
however
I think there is a reason for /20, and it can't be just removed:
InlineCost += Caller->size()/20;
      

This sounds like unanticipated fallout from Evan's recent tweaks of
the inliner. Evan, thoughts?
    
Previously the inliner assign each basic block cost of 20. So this
line is simply estimating the number of caller basic blocks. My tweak
simply removed the number of basic blocks from the equation so the
cost of a callee is simply number of instructions * 5. I don't think
it should / would impact this case. Edwin, can you revert 49061 and
48725 to see if they have any impact?
  
Reverting 49061 does have an impact. I can compile the testcase withing
1.1G of memory, and memory usage was rising slowly (except for a quick
500M->1G jump).
The patch couldn't be reverted cleanly, I had to manually revert parts
of it.

Best regards,
--Edwin

Evan Cheng wrote:

Another question to ask, is why that function became so large in the
first place [X86DAGToDAGISel::SelectCode(llvm::SDOperand)]
We have inline limits, don't we?

most of functions called by SelectCode get a -30000 cost reduction
because they are internal.
Even if Caller.size() is 40000, the penalty is only 2000, thus we
still
have negative costs.

Removing the /20 factor from here improves the situation a lot,
however
I think there is a reason for /20, and it can't be just removed:
InlineCost += Caller->size()/20;

This sounds like unanticipated fallout from Evan's recent tweaks of
the inliner. Evan, thoughts?

Previously the inliner assign each basic block cost of 20. So this
line is simply estimating the number of caller basic blocks. My tweak
simply removed the number of basic blocks from the equation so the
cost of a callee is simply number of instructions * 5. I don't think
it should / would impact this case. Edwin, can you revert 49061 and
48725 to see if they have any impact?

Reverting 49061 does have an impact. I can compile the testcase withing
1.1G of memory, and memory usage was rising slowly (except for a quick
500M->1G jump).

Thanks.

The patch couldn't be reverted cleanly, I had to manually revert parts
of it.

I don't actually want to revert the patch. I'll tweak the heuristics though.

Evan