Memory utilization problems in profile reader

I’ve been experimenting with profiled bootstraps using sample profiles. Initially, I made stage2 build stage3 while running under Perf. This produced a 20Gb profile which took too long to convert to LLVM, and used ~30Gb of RAM. So, I decided that this was not going to be very useful for general usage.

I then changed the bootstrap to instead run each individual compile under Perf. This produced ~2,200 profiles, each of which took up to 1 minute to convert, and then they all have to be merged into a single profile. Also didn’t like it.

Since all compiles are more or less the same in terms of what the compiler does, I decided to take the top 10 biggest profiles and merge those. That seemed to work. This resulted in a 21Mb profile that I could use as input to -fprofile-sample-use.

I started stage 3 of the bootstrap and left it to work. I noticed it was slow, so I thought “we’ll need to speed things up”. The build never finished. Instead, ninja crashed my machine.

It turns out that each clang invocation was growing to 4Gb of RSS. All that memory is being allocated by the profile reader (https://drive.google.com/file/d/0B9lq1VKvmXKFQVp1cGtZM2RSdWc/view?usp=sharing).

So, heads up, we need to trim it down. Perhaps by only loading one function profile at a time, use it and actively discard it. Or simply be better at flushing the reader data structures as they’re used during annotations. I’ll be sending patches about this in the coming days.

It’s likely that the sample reader is doing something silly here. Duncan, Justin, do you have memories of issues like this one with instrumentation? I’ll be trying a similar experiment with it after I’m done with the biggest issues in the sampler.

Thanks. Diego.

Can you extract the relevant part of the heap profile data? How large is the sample profile data fed to the compiler?

The indexed format profile size for clang is <100MB. The InstrProfRecord for each function is read, used and discarded one at a time, so there should not be problem as described.

David

Can you extract the relevant part of the heap profile data?

It's all profile data, actually. The heap utilization is massively
dominated by the profile reader.

  How large is the sample profile data fed to the compiler?

For this run, the input file was 21Mb.

The indexed format profile size for clang is <100MB. The InstrProfRecord
for each function is read, used and discarded one at a time, so there
should not be problem as described.

Good.

So, I traced it down to the DenseMaps in class FunctionSamples. I’ve replaced them with two std::vector, and the read operation causes the compiler to grow from 70Mb to 280Mb. With the DenseMaps, reading the profile causes the compiler to grow from 70Mb to 3Gb.

Somehow the DenseMaps are causing a 10x growth factor. Those keys are probably an issue. Or perhaps we just need a different representation for sample records and call sites.

Yes. In going through DenseMap's implementation I see that large values for
keys will cause a lot of growth. And the documentation confirms it
(someday I'll learn to read documentation first).

I agree, it's a pretty major pitfall. It'd be nice if DenseMap used
std::unique_ptr or something under the hood for value types over a certain
size.

Isn’t that ironic as ‘Dense’ should mean small and compact? Can that be fixed?

David

So, I traced it down to the DenseMaps in class FunctionSamples. I've
replaced them with two std::vector, and the read operation causes the
compiler to grow from 70Mb to 280Mb. With the DenseMaps, reading the
profile causes the compiler to grow from 70Mb to 3Gb.

The growth from 70Mb to 280Mb for a ~20Mb profile is also alarming, IMO.

David

I think, at the time it was written, the goal was to improve on std::map
for small data types, so it is "dense" in the sense that it does not
fragment the heap.

We also have this naming problem for SmallVector and TinyPtrVector. These
data types both claim to use less memory, but they are optimized in
different directions. TinyPtrVector is minimizes object size so that it can
be used efficiently in DenseMap, while SmallVector increases the object
footprint to avoid heap allocations in favor of stack allocations.

Yes, but that’s a second order issue. I’ll deal with that after this one’s settled.

Can you extract the relevant part of the heap profile data? How large is
the sample profile data fed to the compiler?

The indexed format profile size for clang is <100MB. The InstrProfRecord
for each function is read, used and discarded one at a time, so there
should not be problem as described.

If I'm reading the code right, we are also doing O(keys of the hash table)
memory allocation in the indexed reader here:
http://llvm.org/docs/doxygen/html/classllvm_1_1InstrProfReaderIndex.html#acc49fd2c0a8c8dfc3e29b01e09869af7
?
That seems unnecessary. (it seems to be used for value profiling stuff for
some reason?)

-- Sean Silva

Can you extract the relevant part of the heap profile data? How large
is the sample profile data fed to the compiler?

The indexed format profile size for clang is <100MB. The InstrProfRecord
for each function is read, used and discarded one at a time, so there
should not be problem as described.

If I'm reading the code right, we are also doing O(keys of the hash table)
memory allocation in the indexed reader here:
http://llvm.org/docs/doxygen/html/classllvm_1_1InstrProfReaderIndex.html#acc49fd2c0a8c8dfc3e29b01e09869af7
?
That seems unnecessary. (it seems to be used for value profiling stuff for
some reason?)

It is for value profiling -- it is used to convert on-disk callee target
value (in md5) to unique string pointer when the function record's VP data
is read from memory. I will check its memory overhead at some point. This
(the translation) is not strictly needed as a matter of fact (which I
actually wanted to get rid of, but did not find time to do yet -- it is on
my TODO list).

David

Can you extract the relevant part of the heap profile data? How large
is the sample profile data fed to the compiler?

The indexed format profile size for clang is <100MB. The
InstrProfRecord for each function is read, used and discarded one at a
time, so there should not be problem as described.

If I'm reading the code right, we are also doing O(keys of the hash
table) memory allocation in the indexed reader here:
http://llvm.org/docs/doxygen/html/classllvm_1_1InstrProfReaderIndex.html#acc49fd2c0a8c8dfc3e29b01e09869af7
?
That seems unnecessary. (it seems to be used for value profiling stuff
for some reason?)

It is for value profiling -- it is used to convert on-disk callee target
value (in md5) to unique string pointer when the function record's VP data
is read from memory. I will check its memory overhead at some point. This
(the translation) is not strictly needed as a matter of fact (which I
actually wanted to get rid of, but did not find time to do yet -- it is on
my TODO list).

Thanks. Good to know it is on your radar.

-- Sean Silva