RFC: DenseMap grow() slowness

There’s a few passes in LLVM that make heavy use of a big DenseMap, one that potentially gets filled with up to 1 entry for each instruction in the function. EarlyCSE is the best example, but Reassociate and MachineCSE have this to some degree as well (there might be others?). To put it simply: at least in my profile, EarlyCSE spends ~1/5 of its time growing DenseMaps. This is kind of… bad.

grow() is inherently slow because it needs to rehash and reinsert everything. This means growing a DenseMap costs much, much more than growing, for example, a vector. I talked about this with a few people and here are some possibilities we’ve come up with to improve this (some of which probably aren’t what we want):

  1. Use a map that doesn’t require rehashing and reinsertion to grow. Chaining lets you do this, but std::unordered_map is probably so much slower than DenseMap we’d lose more than we gain.
  2. Include the hash code in the map so that we don’t have to rehash. 32 bits more per entry (or whatever), and it might not help that much, since we still have to do the whole reinsertion routine.
  3. Pre-calculate an estimate as to the map size we need. For example, in EarlyCSE, this is possibly gross overestimate of size needed:

unsigned InstCount = 0;
unsigned LoadCount = 0;
unsigned CallCount = 0;
for (inst_iterator FI = inst_begin(F), FE = inst_end(F); FI != FE; ++FI) {
if (FI->mayReadOrWriteMemory())
++LoadCount;
else if (isa(*FI))
++CallCount;
else
++InstCount;
}
AvailableValues.resize(InstCount);
AvailableLoads.resize(LoadCount);
AvailableCalls.resize(CallCount);

But it does the job, and saves ~20% of time in EarlyCSE on my profiles. Yes, iterating over the entire function is way cheaper than grow(). Downsides are that while it’s still bounded by function size, it could end up allocating a good bit more depending on — in EarlyCSE’s case — the control flow/dominator structure.

Any thoughts on this, or other less ugly alternatives? I estimate that, at least in our pass pipeline, we’re losing at least ~1% of total time to avoidable DenseMap::grow() operations, which feels a little bit… unnecessary.

—escha

From: "via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Tuesday, March 15, 2016 5:07:29 PM
Subject: [llvm-dev] RFC: DenseMap grow() slowness

There’s a few passes in LLVM that make heavy use of a big DenseMap,
one that potentially gets filled with up to 1 entry for each
instruction in the function. EarlyCSE is the best example, but
Reassociate and MachineCSE have this to some degree as well (there
might be others?). To put it simply: at least in my profile,
EarlyCSE spends ~1/5 of its time growing DenseMaps. This is kind of…
bad.

grow() is inherently slow because it needs to rehash and reinsert
everything. This means growing a DenseMap costs much, much more than
growing, for example, a vector. I talked about this with a few
people and here are some possibilities we’ve come up with to improve
this (some of which probably aren’t what we want):

1. Use a map that doesn’t require rehashing and reinsertion to grow.
Chaining lets you do this, but std::unordered_map is probably so
much slower than DenseMap we’d lose more than we gain.
2. Include the hash code in the map so that we don’t have to rehash.
32 bits more per entry (or whatever), and it might not help that
much, since we still have to do the whole reinsertion routine.
3. Pre-calculate an estimate as to the map size we need. For example,
in EarlyCSE, this is possibly gross overestimate of size needed:

unsigned InstCount = 0 ;
unsigned LoadCount = 0 ;
unsigned CallCount = 0 ;
for ( inst_iterator FI = inst_begin ( F ), FE = inst_end ( F ); FI !=
FE; ++FI) {
if (FI-> mayReadOrWriteMemory ())
++LoadCount;
else if ( isa < CallInst >(*FI))
++CallCount;
else
++InstCount;
}
AvailableValues . resize (InstCount);
AvailableLoads . resize (LoadCount);
AvailableCalls . resize (CallCount);

But it does the job, and saves ~20% of time in EarlyCSE on my
profiles. Yes, iterating over the entire function is way cheaper
than grow(). Downsides are that while it’s still bounded by function
size, it could end up allocating a good bit more depending on — in
EarlyCSE’s case — the control flow/dominator structure.

This last option makes perfect sense to me. One thing we might be able to do regarding the extra memory overhead is, instead of actually resizing up front, to start with a relatively small map, but use the function size to set the growth factor so that we grow only a small number of times (say once) in the worst case.

-Hal

Growing fewer times doesn’t actually help much from what I can tell, because the largest grow() always dominates the cost of the others.

For example, if you start with 8 buckets, and you end up needing 512 buckets, and you grow 2x at a time:

8, 16, 32, 64, 128, 256, 512

The final grow() costs as much as all the others combined. So if you grow 4x at a time:

8, 32, 128, 512

you don’t actually save much; I think the gain is probably bounded at a factor of 2.

—escha

From: escha@apple.com
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Tuesday, March 15, 2016 5:17:46 PM
Subject: Re: [llvm-dev] RFC: DenseMap grow() slowness

> > From: "via llvm-dev" < llvm-dev@lists.llvm.org >
>

> > To: "llvm-dev" < llvm-dev@lists.llvm.org >
>

> > Sent: Tuesday, March 15, 2016 5:07:29 PM
>

> > Subject: [llvm-dev] RFC: DenseMap grow() slowness
>

> > There’s a few passes in LLVM that make heavy use of a big
> > DenseMap,
> > one that potentially gets filled with up to 1 entry for each
> > instruction in the function. EarlyCSE is the best example, but
> > Reassociate and MachineCSE have this to some degree as well
> > (there
> > might be others?). To put it simply: at least in my profile,
> > EarlyCSE spends ~1/5 of its time growing DenseMaps. This is kind
> > of…
> > bad.
>

> > grow() is inherently slow because it needs to rehash and reinsert
> > everything. This means growing a DenseMap costs much, much more
> > than
> > growing, for example, a vector. I talked about this with a few
> > people and here are some possibilities we’ve come up with to
> > improve
> > this (some of which probably aren’t what we want):
>

> > 1. Use a map that doesn’t require rehashing and reinsertion to
> > grow.
> > Chaining lets you do this, but std::unordered_map is probably so
> > much slower than DenseMap we’d lose more than we gain.
>

> > 2. Include the hash code in the map so that we don’t have to
> > rehash.
> > 32 bits more per entry (or whatever), and it might not help that
> > much, since we still have to do the whole reinsertion routine.
>

> > 3. Pre-calculate an estimate as to the map size we need. For
> > example,
> > in EarlyCSE, this is possibly gross overestimate of size needed:
>

> > unsigned InstCount = 0 ;
>

> > unsigned LoadCount = 0 ;
>

> > unsigned CallCount = 0 ;
>

> > for ( inst_iterator FI = inst_begin ( F ), FE = inst_end ( F );
> > FI
> > !=
> > FE; ++FI) {
>

> > if (FI-> mayReadOrWriteMemory ())
>

> > ++LoadCount;
>

> > else if ( isa < CallInst >(*FI))
>

> > ++CallCount;
>

> > else
>

> > ++InstCount;
>

> > }
>

> > AvailableValues . resize (InstCount);
>

> > AvailableLoads . resize (LoadCount);
>

> > AvailableCalls . resize (CallCount);
>

> > But it does the job, and saves ~20% of time in EarlyCSE on my
> > profiles. Yes, iterating over the entire function is way cheaper
> > than grow(). Downsides are that while it’s still bounded by
> > function
> > size, it could end up allocating a good bit more depending on —
> > in
> > EarlyCSE’s case — the control flow/dominator structure.
>

> This last option makes perfect sense to me. One thing we might be
> able to do regarding the extra memory overhead is, instead of
> actually resizing up front, to start with a relatively small map,
> but use the function size to set the growth factor so that we grow
> only a small number of times (say once) in the worst case.

Growing fewer times doesn’t actually help much from what I can tell,
because the largest grow() always dominates the cost of the others.

For example, if you start with 8 buckets, and you end up needing 512
buckets, and you grow 2x at a time:

8, 16, 32, 64, 128, 256, 512

The final grow() costs as much as all the others combined. So if you
grow 4x at a time:

8, 32, 128, 512

you don’t actually save much; I think the gain is probably bounded at
a factor of 2.

Understood; but that wasn't what I had in mind exactly. What about if we added a threshold and a target size (set as above), so that the growth factors would do this:

8, 16, 32, 64, 128, ..., threshold (I suppose this will be big), target size

i.e. once you hit some threshold, we assume the map will be large, and we jump right to your precalculated target size? Would that bring the benefits while mitigating much of the extra memory overhead?

-Hal

yes it makes sense. Avoid using DenseMap when the size of the map is expected to be large but can not be pre-determined.

David

What should we use instead of DenseMap?

—escha

In earlyCSE case, the size of DenseMap can be determined ahead of time, so it is fine to use (assuming the iteration overhead is low). There are other factors to consider when using DenseMap. It reduces one level of indirection by making the buckets an array of key/value pairs (unlike StringMap where buckets is an array of pointers to key/value pairs). When the value type is large and is expensive to construct, the overhead of rehashing DenseMap becomes even higher. The memory overhead also becomes larger as it has more large holes when the map is sparse.

I don’t have a good answer to what the right alternatives (unordered_map, std::map etc – it depends many factors depending on insertion and query patterns, value type, average size of the table etc and worse case scenarios). It is case by case so doing some experiment with performance data will be a helpful exercise.

thanks,

david

Only an upper bound; the actual max size is the number of CSE-able instructions “live” in scope at any one time (I think), so at least in theory it could be a gross overestimate.

—escha

>
> In earlyCSE case, the size of DenseMap can be determined ahead of time

Only an upper bound; the actual max size is the number of CSE-able
instructions “live” in scope at any one time (I think), so at least in
theory it could be a gross overestimate.

Perhaps some heuristics can be applied to reduce the estimated size
assuming there are some correlation on average, I guess.

David

Slightly OT, but it looks like the LoadValue (value type of the AvailableLoads structure) is relatively memory inefficient. At minimum, we could get rid of the IsAtomic space by using a tagged pointer. That would at least bring us down to a 128 bits (a nice power of two). That might help speed up some of the copying. p.s. Good find! Philip

There’s a few passes in LLVM that make heavy use of a big DenseMap, one
that potentially gets filled with up to 1 entry for each instruction in the
function. EarlyCSE is the best example, but Reassociate and MachineCSE have
this to some degree as well (there might be others?). To put it simply: at
least in my profile, EarlyCSE spends ~1/5 of its time growing DenseMaps.
This is kind of… bad.

grow() is inherently slow because it needs to rehash and reinsert
everything. This means growing a DenseMap costs much, much more than
growing, for example, a vector. I talked about this with a few people and
here are some possibilities we’ve come up with to improve this (some of
which probably aren’t what we want):

1. Use a map that doesn’t require rehashing and reinsertion to grow.
Chaining lets you do this, but std::unordered_map is probably so much
slower than DenseMap we’d lose more than we gain.
2. Include the hash code in the map so that we don’t have to rehash. 32
bits more per entry (or whatever), and it might not help that much, since
we still have to do the whole reinsertion routine.

Another choice is to implement DenseMap using segmented array (of buckets)
-- the segment size is fixed so finding a bucket using bucket number can be
pretty cheap : segment number + segment offset.

With this, there is no need for resizing.

David

Just to note, it looks like I was testing on a somewhat older LLVM version that didn’t have LoadValue at all, so my guess is this means it’s even -worse- in trunk.

—escha

Er, LoadValue’s been around for a while (6 months). How far back are you testing? I’d strongly suggest switching to something more recent. Philip

Some of us have to support internal release branches for non-trivial amounts of time.

—Owen

Right, but how does that relate to the code base you’re looking at when analyzing performance and proposing changes on ToT?

We try very hard to upstream all improvements we make to LLVM, even if we find them while investigating or debugging older release branches. We wouldn’t be proposing it if we did not believe it would be of benefit to ToT as well.

—Owen