High memory use and LVI/Correlated Value Propagation

Hi all,
with the current trunk I have two major cases where clang needs more
than 2GB memory for compiling programs with -O2. One is related to GVN
and MemoryDependenceAnalysis and has a pending patch. The other is
related to the Correlated Value Propagation and Lazy Value Information
cache. Attached is a heap profile for one of the relevant test cases.
Looking at the sources, I don't see any form of CFG limiters in LVI or
CVP, so it doesn't seem surprising that large memory use can be easily
triggered. This normally also corresponds to very slow compilation, so
it addressing it fixes two issues at the same time. A test case can be
found in 10584 – Extremely slow compilation spending time in Value Propagation.

The question for me is: where should such complexity limits be placed --
in the CVP pass or inside LVI? Is anyone specifically interested in
hunting down the problem?

Joerg

expr.pdf (12 KB)

I don’t think that arbitrary limiting the complexity of the search is the right approach. There are numerous ways the LVI infrastructure could be made more memory efficient. Fixing the existing code to be memory efficient is the right approach. Only once there’s no more low hanging fruit should we even consider clamping the search.

In general, introducing clamps should be an absolute last resort. We really don’t want an optimizer than if fragile with respect to code size and shape.

Philip

Memory efficiency is only half of the problem. I.e. groonga's expr.c
needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound
reasonable for a -O2. Unlike the GVN issue, the cases I have run into do
finish after a(n unreasonable) while, so at least it is not trivially
superlinear.

Joerg

Okay, so rather than artificially limit stuff, we should see if we can fix
the efficiency of the algorithms.

CVP is an O(N*lattice height) pass problem. It sounds like it is being more
than that for you.

I assume you mean something like #BB * #variables? The instances I have
seen are all very large functions with many branches. Consider it from
this perspective: there is currently only one hammer for controlling the
amount of memory/CPU time CVP will use from clang -- -O0 vs -O2. I
believe that -O2 should provide a result in a reasonable amount of time
and with a reasonable amount of memory. The 2GB clamp (and 1h of CPU
time) for a 64bit clang are similar to the limits for a native build on
a 32bit architecture, so they are not arbitrary constraints.

Joerg

No.
:wink:
I mean the theoretical complexity of performing the optimization CVP
performs (regardless of how it is currently implemented).

What CVP does is doable in linear time on number of variables * lattice
height. It has some memory cost to do this.

Assume for a second the lattice height is fixed (IE we do not have
completely symbolic ranges).

It is possible, in linear time, to build an SSA-like form for the ranges it
is computing.
It is then possible, in linear time, to propagate those ranges to all
affected variables.

This is O(N) + O(N).

Doing so may change the ranges (they only get *better*, however).
So you may need to do this until they fall all the way down the lattice.

This means you may repeat this lattice height number of times.

CVP as currently implemented tries to be very lazy. The cost of being very
lazy is that it has a worse theoretical complexity in the worst case, but
is faster in cases there is not a lot to do.

Hi Joerg,

with the current trunk I have two major cases where clang needs more
than 2GB memory for compiling programs with -O2. One is related to GVN
and MemoryDependenceAnalysis and has a pending patch. The other is
related to the Correlated Value Propagation and Lazy Value Information
cache.

For the LVI part, how recent are you measurements? I wonder whether
Akira's r255320 helps.

r255320 => [LazyValueInfo] Stop inserting overdefined values into
ValueCache to reduce memory usage.

From: "Daniel Berlin via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 10:38:10 AM
Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value
Propagation

> >

> > > > I don't think that arbitrary limiting the complexity of the
> > > > search is the

> > > > right approach. There are numerous ways the LVI
> > > > infrastructure
> > > > could be

> > > > made more memory efficient. Fixing the existing code to be
> > > > memory

> > > efficient

> > > > is the right approach. Only once there's no more low hanging
> > > > fruit

> > > should

> > > > we even consider clamping the search.

> > >

> > > Memory efficiency is only half of the problem. I.e. groonga's
> > > expr.c

> > > needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound

> > > reasonable for a -O2. Unlike the GVN issue, the cases I have
> > > run
> > > into do

> > > finish after a(n unreasonable) while, so at least it is not
> > > trivially

> > > superlinear.

> > >

> >

> >

> > Okay, so rather than artificially limit stuff, we should see if
> > we
> > can fix

> > the efficiency of the algorithms.

> >

> > CVP is an O(N*lattice height) pass problem. It sounds like it is
> > being more

> > than that for you.

> I assume you mean something like #BB * #variables?

No.
:wink:
I mean the theoretical complexity of performing the optimization CVP
performs (regardless of how it is currently implemented).

What CVP does is doable in linear time on number of variables *
lattice height. It has some memory cost to do this.

Assume for a second the lattice height is fixed (IE we do not have
completely symbolic ranges).

It is possible, in linear time, to build an SSA-like form for the
ranges it is computing.
It is then possible, in linear time, to propagate those ranges to all
affected variables.

This is O(N) + O(N).

Doing so may change the ranges (they only get *better*, however).
So you may need to do this until they fall all the way down the
lattice.

This means you may repeat this lattice height number of times.

What is the height of the lattice here? I understand that it goes from top -> defined -> overdefined, and that's three, but defined is not itself one state. Because it is tracking integer ranges, those ranges could undergo 2^#bits refinements in the worst case (thus making the lattice height quite large), no?

Thanks again,
Hal

Did it take an hour to compile the file attached to PR10584 (hugemem.ii)?

It took about 4 minutes and about 450MB of memory on my machine to compile the file with “clang++ -O2”.

I've rebuilt all my test cases since the weekend. It's been getting
better and I have now down to Groonga, G'MIC, FET and Scribus as
triggering, which is much better.

Joerg

> From: "Daniel Berlin via llvm-dev" <llvm-dev@lists.llvm.org>
> To: "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
> <llvm-dev@lists.llvm.org>
> Sent: Thursday, January 14, 2016 10:38:10 AM
> Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value
> Propagation

>
>
> > >
>
>
> > > > > I don't think that arbitrary limiting the complexity of the
> > > > > search is the
>
> > > > > right approach. There are numerous ways the LVI
> > > > > infrastructure
> > > > > could be
>
> > > > > made more memory efficient. Fixing the existing code to be
> > > > > memory
>
> > > > efficient
>
> > > > > is the right approach. Only once there's no more low hanging
> > > > > fruit
>
> > > > should
>
> > > > > we even consider clamping the search.
>
> > > >
>
> > > > Memory efficiency is only half of the problem. I.e. groonga's
> > > > expr.c
>
> > > > needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound
>
> > > > reasonable for a -O2. Unlike the GVN issue, the cases I have
> > > > run
> > > > into do
>
> > > > finish after a(n unreasonable) while, so at least it is not
> > > > trivially
>
> > > > superlinear.
>
> > > >
>
> > >
>
> > >
>
> > > Okay, so rather than artificially limit stuff, we should see if
> > > we
> > > can fix
>
> > > the efficiency of the algorithms.
>
> > >
>
> > > CVP is an O(N*lattice height) pass problem. It sounds like it is
> > > being more
>
> > > than that for you.
>

> > I assume you mean something like #BB * #variables?
>
> No.
> :wink:
> I mean the theoretical complexity of performing the optimization CVP
> performs (regardless of how it is currently implemented).

> What CVP does is doable in linear time on number of variables *
> lattice height. It has some memory cost to do this.

> Assume for a second the lattice height is fixed (IE we do not have
> completely symbolic ranges).

> It is possible, in linear time, to build an SSA-like form for the
> ranges it is computing.
> It is then possible, in linear time, to propagate those ranges to all
> affected variables.

> This is O(N) + O(N).

> Doing so may change the ranges (they only get *better*, however).
> So you may need to do this until they fall all the way down the
> lattice.

> This means you may repeat this lattice height number of times.

What is the height of the lattice here?

The current LVI is a bit hard to analyze, i'm not sure it's really as fixed
height as one would think.

I understand that it goes from top -> defined -> overdefined, and that's
three, but defined is not itself one state. Because it is tracking integer
ranges, those ranges could undergo 2^#bits refinements in the worst case
(thus making the lattice height quite large), no?

If it tracks ranges on a per-bit basis, the max refinements for a given
range is is O(bits * 3), assuming things do not go the wrong way on the
lattice.

IE either they are underdefined, or they are 1, or they are 0, or they are
overdefined.

The lattice is

underdefined
/ \
0 1
\ /
overdefined

The only way to end up with 2**n refinements is if things can go the other
direction on the lattice.

Note: This is the standard fixpoint algorithm (and what LVI is emulating,
AFAIK). There are range constraint based solvers that are significantly
more complicated, but that's not what LVI does.

(see https://www.cs.berkeley.edu/~daw/papers/range-tacas04.pdf, and the
paper i cite below, which does a variant of that on llvm in pretty much
linear time)

Also, the worst case here is one where every variable is affected by a
single range.

(it's also possible to lazily construct the ssa form, and only update
changed variables, in the same time bound. It's probably complicated enough
to implement that it may not be worth it).

In practice, you can do even complicated ranges very fast.

http://homepages.dcc.ufmg.br/~fernando/publications/papers/CGO13_raphael.pdf

(the time is basically linear in practice, even for large programs, code is
on github)

Ignore the part where they use it in the end for something else that we
don't care about, etc.

They are also solving for a much more complex set of ranges than we are
talking about for CVP, and for all the variables, so they use a real range
constraint solver in the end instead of a fixpoint with a fixed height
lattice.

From: "Daniel Berlin" <dberlin@dberlin.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 1:05:56 PM
Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value
Propagation

> > From: "Daniel Berlin via llvm-dev" < llvm-dev@lists.llvm.org >

> > To: "Joerg Sonnenberger" < joerg@britannica.bec.de >, "llvm-dev"

> > < llvm-dev@lists.llvm.org >

> > Sent: Thursday, January 14, 2016 10:38:10 AM

> > Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value

> > Propagation

> >

> > > > On Wed, Jan 13, 2016 at 4:23 PM, Joerg Sonnenberger via
> > > > llvm-dev

> > > > <

> >

> >

> > > >

> >

> >

> > > > > > I don't think that arbitrary limiting the complexity of
> > > > > > the

> > > > > > search is the

> >

> > > > > > right approach. There are numerous ways the LVI

> > > > > > infrastructure

> > > > > > could be

> >

> > > > > > made more memory efficient. Fixing the existing code to
> > > > > > be

> > > > > > memory

> >

> > > > > efficient

> >

> > > > > > is the right approach. Only once there's no more low
> > > > > > hanging

> > > > > > fruit

> >

> > > > > should

> >

> > > > > > we even consider clamping the search.

> >

> > > > >

> >

> > > > > Memory efficiency is only half of the problem. I.e.
> > > > > groonga's

> > > > > expr.c

> >

> > > > > needs 4m to build on my laptop, a 2.7GHz i7. That doesn't
> > > > > sound

> >

> > > > > reasonable for a -O2. Unlike the GVN issue, the cases I
> > > > > have

> > > > > run

> > > > > into do

> >

> > > > > finish after a(n unreasonable) while, so at least it is not

> > > > > trivially

> >

> > > > > superlinear.

> >

> > > > >

> >

> > > >

> >

> > > >

> >

> > > > Okay, so rather than artificially limit stuff, we should see
> > > > if

> > > > we

> > > > can fix

> >

> > > > the efficiency of the algorithms.

> >

> > > >

> >

> > > > CVP is an O(N*lattice height) pass problem. It sounds like it
> > > > is

> > > > being more

> >

> > > > than that for you.

> >

> > > I assume you mean something like #BB * #variables?

> >

> > No.

> > :wink:

> > I mean the theoretical complexity of performing the optimization
> > CVP

> > performs (regardless of how it is currently implemented).

> > What CVP does is doable in linear time on number of variables *

> > lattice height. It has some memory cost to do this.

> > Assume for a second the lattice height is fixed (IE we do not
> > have

> > completely symbolic ranges).

> > It is possible, in linear time, to build an SSA-like form for the

> > ranges it is computing.

> > It is then possible, in linear time, to propagate those ranges to
> > all

> > affected variables.

> > This is O(N) + O(N).

> > Doing so may change the ranges (they only get *better*, however).

> > So you may need to do this until they fall all the way down the

> > lattice.

> > This means you may repeat this lattice height number of times.

> What is the height of the lattice here?

The current LVI is a bit hard to analyze, i'm not sure it's really as
fixed height as one would think.

> I understand that it goes from top -> defined -> overdefined, and
> that's three, but defined is not itself one state. Because it is
> tracking integer ranges, those ranges could undergo 2^#bits
> refinements in the worst case (thus making the lattice height quite
> large), no?

If it tracks ranges on a per-bit basis, the max refinements for a
given range is is O(bits * 3), assuming things do not go the wrong
way on the lattice.

IE either they are underdefined, or they are 1, or they are 0, or
they are overdefined.

The lattice is

underdefined
/ \
0 1
\ /
overdefined

As a separate matter, we *should* also do this (i.e. have a fixed-point known-bits analysis), but we don't.

The only way to end up with 2**n refinements is if things can go the
other direction on the lattice.

Note: This is the standard fixpoint algorithm (and what LVI is
emulating, AFAIK). There are range constraint based solvers that are
significantly more complicated, but that's not what LVI does.

Right. LVI handles only single constant-bounded ranges for each variable independently.

(see https://www.cs.berkeley.edu/~daw/papers/range-tacas04.pdf , and
the paper i cite below, which does a variant of that on llvm in
pretty much linear time)

Interesting; thanks for the references! For both, the analysis collapses SCCs in the control flow and, while the SCC analysis is quadratic, SCCs are generally small so they don't observe a problem in practice.

-Hal

> I assume you mean something like #BB * #variables?

No.
:wink:
I mean the theoretical complexity of performing the optimization CVP
performs (regardless of how it is currently implemented).

Well, yes, talking about big-O notation is fine, the question is what
the N stands for :slight_smile:

What CVP does is doable in linear time on number of variables * lattice
height. It has some memory cost to do this.

Right, that's what I mean. I think the lattice height can degenerate to
the number of BBs in a function easily. When looking at the size of the
IR, it effectively becomes O(n^2) where n is the size of the IR as worst
case. That would justify finding some cut-off point, even when it is
quite large and depending on the optimizer level (-O2 vs -O3?).

Joerg

The lattice height will never be based on the size of the BB’s because the bb’s are irrelevant to the lattice value of a given variable (which is usually a range).

IE the value a variable takes on is not related to it’s bb, but only the range of things representable by that value type, plus a few different types of things we may want to represent symbolically (IE value A is > 5).

in simple lattices, “range of things representable by that value type” is each a different transition, so it’s a very wide lattice, but not a very tall one.

So, for example, constant propagation is usually a 3 height lattice, even if the width of the middle lattice value is “every constant value a given variable could take on”. This is because a single variable can only take on one of those values in SSA.

(Deliberately replying up thread to skip detailed discussion of the lattice.) We have had bugs in the past which causes us to move back up the lattice. The most likely cause of long running time() is an accidental reversal in the lattice. Note that we move up the lattice intentionally in some cases (specifically around assumes) when intersecting facts from different sources. () Assuming we’re talking about an algorithmic issue and not simply poor implementation quality. We’re definitely more wasteful about memory than we need to be for instance. Depending on the caching behavior of the algorithm, this could appear super linear in running time. If you have found a case that involves many steps for each variable in the lattice, one principled fix would be to bound the number of constant range refinements allowed. e.g. If you updated the constant range 5 times, drop to overdefined. Philip

I'll add the problematic files I have to 10584.

Joerg

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Daniel Berlin" <dberlin@dberlin.org>, "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 3:26:02 PM
Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value Propagation

> I don't think that arbitrary limiting the complexity of the search
> is the
> right approach. There are numerous ways the LVI infrastructure
> could be
> made more memory efficient. Fixing the existing code to be memory
> efficient
> is the right approach. Only once there's no more low hanging fruit
> should
> we even consider clamping the search.

Memory efficiency is only half of the problem. I.e. groonga's expr.c
needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound
reasonable for a -O2. Unlike the GVN issue, the cases I have run into
do
finish after a(n unreasonable) while, so at least it is not trivially
superlinear.

Okay, so rather than artificially limit stuff, we should see if we
can fix the efficiency of the algorithms.

CVP is an O(N*lattice height) pass problem. It sounds like it is
being more than that for you. (Deliberately replying up thread to
skip detailed discussion of the lattice.)

We have had bugs in the past which causes us to move back up the
lattice. The most likely cause of long running time(*) is an
accidental reversal in the lattice. Note that we move up the lattice
intentionally in some cases (specifically around assumes) when
intersecting facts from different sources.

(*) Assuming we're talking about an algorithmic issue and not simply
poor implementation quality. We're definitely more wasteful about
memory than we need to be for instance. Depending on the caching
behavior of the algorithm, this could appear super linear in running
time.

If you have found a case that involves many steps for each variable
in the lattice, one principled fix would be to bound the number of
constant range refinements allowed. e.g. If you updated the constant
range 5 times, drop to overdefined.

I agree, although it might be nice only to limit the number of refinements across backedges (because, if I'm thinking about this correctly, that's the only way to really get any actually-bad behavior). If there are no backedges, then the number of refinements is limited by the number of instructions in any case.

-Hal

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Daniel Berlin" <dberlin@dberlin.org>, "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 3:26:02 PM
Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value Propagation

I don't think that arbitrary limiting the complexity of the search
is the
right approach. There are numerous ways the LVI infrastructure
could be
made more memory efficient. Fixing the existing code to be memory
efficient
is the right approach. Only once there's no more low hanging fruit
should
we even consider clamping the search.

Memory efficiency is only half of the problem. I.e. groonga's expr.c
needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound
reasonable for a -O2. Unlike the GVN issue, the cases I have run into
do
finish after a(n unreasonable) while, so at least it is not trivially
superlinear.

Okay, so rather than artificially limit stuff, we should see if we
can fix the efficiency of the algorithms.

CVP is an O(N*lattice height) pass problem. It sounds like it is
being more than that for you. (Deliberately replying up thread to
skip detailed discussion of the lattice.)

We have had bugs in the past which causes us to move back up the
lattice. The most likely cause of long running time(*) is an
accidental reversal in the lattice. Note that we move up the lattice
intentionally in some cases (specifically around assumes) when
intersecting facts from different sources.

(*) Assuming we're talking about an algorithmic issue and not simply
poor implementation quality. We're definitely more wasteful about
memory than we need to be for instance. Depending on the caching
behavior of the algorithm, this could appear super linear in running
time.

If you have found a case that involves many steps for each variable
in the lattice, one principled fix would be to bound the number of
constant range refinements allowed. e.g. If you updated the constant
range 5 times, drop to overdefined.

I agree, although it might be nice only to limit the number of refinements across backedges (because, if I'm thinking about this correctly, that's the only way to really get any actually-bad behavior). If there are no backedges, then the number of refinements is limited by the number of instructions in any case.

I'd have to double check, but I doubt believe LVI is optimistic about loops at all. I believe it will try to look at each input recursively, special case the case where it encounters the query block while doing so, but otherwise immediately fall to overdefined.

From: "Philip Reames via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Daniel Berlin" <dberlin@dberlin.org>, "Joerg Sonnenberger" <joerg@britannica.bec.de>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 3:26:02 PM
Subject: Re: [llvm-dev] High memory use and LVI/Correlated Value Propagation

I don't think that arbitrary limiting the complexity of the search
is the
right approach. There are numerous ways the LVI infrastructure
could be
made more memory efficient. Fixing the existing code to be memory
efficient
is the right approach. Only once there's no more low hanging fruit
should
we even consider clamping the search.

Memory efficiency is only half of the problem. I.e. groonga's expr.c
needs 4m to build on my laptop, a 2.7GHz i7. That doesn't sound
reasonable for a -O2. Unlike the GVN issue, the cases I have run into
do
finish after a(n unreasonable) while, so at least it is not trivially
superlinear.

Okay, so rather than artificially limit stuff, we should see if we
can fix the efficiency of the algorithms.

CVP is an O(N*lattice height) pass problem. It sounds like it is
being more than that for you. (Deliberately replying up thread to
skip detailed discussion of the lattice.)

We have had bugs in the past which causes us to move back up the
lattice. The most likely cause of long running time(*) is an
accidental reversal in the lattice. Note that we move up the lattice
intentionally in some cases (specifically around assumes) when
intersecting facts from different sources.

(*) Assuming we're talking about an algorithmic issue and not simply
poor implementation quality. We're definitely more wasteful about
memory than we need to be for instance. Depending on the caching
behavior of the algorithm, this could appear super linear in running
time.

If you have found a case that involves many steps for each variable
in the lattice, one principled fix would be to bound the number of
constant range refinements allowed. e.g. If you updated the constant
range 5 times, drop to overdefined.

I agree, although it might be nice only to limit the number of refinements across backedges (because, if I'm thinking about this correctly, that's the only way to really get any actually-bad behavior). If there are no backedges, then the number of refinements is limited by the number of instructions in any case.

I'd have to double check, but I don't believe LVI is optimistic about loops at all. I believe it will try to look at each input recursively, special case the case where it encounters the query block while doing so, but otherwise immediately fall to overdefined.

Done. With the recent changes, it looks like three of the five cases are
just barely failing and one of them is still off the 2GB barrier by a
lot.

Joerg