Possible Memory Savings for tools emitting large amounts of existing data through MC

Just in case it interests anyone else, I’m playing around with trying to broaden the MCStreamer API to allow for emission of bytes without copying the contents into a local buffer first (either because you already have a buffer, or the bytes are already present in another file, etc) in http://reviews.llvm.org/D17694 . In theory there’s some overlap with lld here (no doubt it already does this sort of thing, but not in a way, I assume, we could reuse from other tools at the moment) and my motivation, llvm-dwp, looks very much like “linking with a few extra steps”.

But to check that these changes might be more generally applicable, I thought I’d solicit data from anyone building tools that might be memory constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of llvm-dsymutil? Do you have an example you could provide that has high memory usage, so I could see if any simple changes based on my prototype MC changes would help.

A quick glance at dsymutil’s code indicates it might benefit slightly, at least - in the string table emission, for example (it looks very similar to string table emission in dwp - just being able to reference the strings in the StringMap rather than copying them into MCStreamer could help (also I found using a DenseMap<StringRef to the memory mapped input helped as well - but that’s a change you can make locally without any MCStreamer improvements) - other parts might be trickier, and consist of parts of referencable data (like the line table header) and parts that are not referencable (like their contents) - my prototype could be extended to handle that)

Hi David,

The way I imagined that we might want to extend the MCStreamer API (this
was motivated by DIEData) is by allowing clients to move bytes and fixups
into the MC layer.

This is the sort of API that I was imagining:

void MoveAndEmitFragment(SmallVectorImpl<char> &&Data,
                         SmallVectorImpl<MCFixup> &&Fixups);

Note that this mirrors the fields
MCEncodedFragmentWithContents::Contents and MCEncodedFragmentWithFixups::Fixups
and the arguments could be directly moved into the fields of a newly created
MCDataFragment.

Would that work for your use case?

Peter

Just in case it interests anyone else, I’m playing around with trying to broaden the MCStreamer API to allow for emission of bytes without copying the contents into a local buffer first (either because you already have a buffer, or the bytes are already present in another file, etc) in http://reviews.llvm.org/D17694 . In theory there’s some overlap with lld here (no doubt it already does this sort of thing, but not in a way, I assume, we could reuse from other tools at the moment) and my motivation, llvm-dwp, looks very much like “linking with a few extra steps”.

But to check that these changes might be more generally applicable, I thought I’d solicit data from anyone building tools that might be memory constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of llvm-dsymutil? Do you have an example you could provide that has high memory usage, so I could see if any simple changes based on my prototype MC changes would help.

Since dsymutil processes object files one after another, memory usage wasn’t really a problem so far, but you could try running llvm-dsymutil on bin/clang for a larger example (takes about a minute to finish).

A quick glance at dsymutil’s code indicates it might benefit slightly, at least - in the string table emission, for example (it looks very similar to string table emission in dwp - just being able to reference the strings in the StringMap rather than copying them into MCStreamer could help (also I found using a DenseMap<StringRef to the memory mapped input helped as well - but that’s a change you can make locally without any MCStreamer improvements) - other parts might be trickier, and consist of parts of referencable data (like the line table header) and parts that are not referencable (like their contents) - my prototype could be extended to handle that)

– adrian

Just in case it interests anyone else, I'm playing around with trying to
broaden the MCStreamer API to allow for emission of bytes without copying
the contents into a local buffer first (either because you already have a
buffer, or the bytes are already present in another file, etc) in
http://reviews.llvm.org/D17694 . In theory there's some overlap with lld
here (no doubt it already does this sort of thing, but not in a way, I
assume, we could reuse from other tools at the moment) and my motivation,
llvm-dwp, looks very much like "linking with a few extra steps".

But to check that these changes might be more generally applicable, I
thought I'd solicit data from anyone building tools that might be memory
constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of
llvm-dsymutil? Do you have an example you could provide that has high
memory usage, so I could see if any simple changes based on my prototype MC
changes would help.

Since dsymutil processes object files one after another,

As does llvm-dwp. Think of llvm-dwp more like a linker with a few extra
bits. But the MCStreamer API means any bytes you write to the streamer stay
in memory until you "Finish" - so if you're dwp/linking large enough
inputs, you have them all in memory when you really don't need them. For
example, the dwp file I was generating is 7GB, but the tool with the memory
improvements only has a high water mark of 2.3GB.

memory usage wasn’t really a problem so far, but you could try running
llvm-dsymutil on bin/clang for a larger example (takes about a minute to
finish).

Was thinking of something more accessible to me, on a non-Darwin platform.
Is there a way I can generate the dsym inputs across Clang on a non-Darwin
platform? (what happens if I run dsymutil on my ELF object files?)

Hi David,

The way I imagined that we might want to extend the MCStreamer API (this
was motivated by DIEData) is by allowing clients to move bytes and fixups
into the MC layer.

This is the sort of API that I was imagining:

void MoveAndEmitFragment(SmallVectorImpl<char> &&Data,
                         SmallVectorImpl<MCFixup> &&Fixups);

Note that this mirrors the fields
MCEncodedFragmentWithContents::Contents and
MCEncodedFragmentWithFixups::Fixups
and the arguments could be directly moved into the fields of a newly
created
MCDataFragment.

Would that work for your use case?

Not quite, unfortunately - the issue is that we're doing a task that is
essentially "linking + a bit" - so imagine linking a bunch of files
together, the final, say, debug_info.dwo section is made up of the
concatenation of all the debug_info.dwo sections of the inputs. So it's
fragmented and it's already available, memory mapped, never in a
SmallVector, etc.

At this point probably nothing. Dsymutil acts on STABS symbol table entries that are (I guess) not present in a typical ELF binary. Dsymutil also only implements MachO relocations and has lots of other things where the ELF implementation is missing. It’s probably not too much work to wire all this up, but so far nobody did it.

– adrian

Just in case it interests anyone else, I'm playing around with trying to
broaden the MCStreamer API to allow for emission of bytes without copying
the contents into a local buffer first (either because you already have a
buffer, or the bytes are already present in another file, etc) in
http://reviews.llvm.org/D17694 . In theory there's some overlap with lld
here (no doubt it already does this sort of thing, but not in a way, I
assume, we could reuse from other tools at the moment) and my motivation,
llvm-dwp, looks very much like "linking with a few extra steps".

But to check that these changes might be more generally applicable, I
thought I'd solicit data from anyone building tools that might be memory
constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of
llvm-dsymutil? Do you have an example you could provide that has high
memory usage, so I could see if any simple changes based on my prototype MC
changes would help.

Since dsymutil processes object files one after another,

As does llvm-dwp. Think of llvm-dwp more like a linker with a few extra
bits. But the MCStreamer API means any bytes you write to the streamer stay
in memory until you "Finish" - so if you're dwp/linking large enough
inputs, you have them all in memory when you really don't need them. For
example, the dwp file I was generating is 7GB, but the tool with the memory
improvements only has a high water mark of 2.3GB.

memory usage wasn’t really a problem so far, but you could try running
llvm-dsymutil on bin/clang for a larger example (takes about a minute to
finish).

Was thinking of something more accessible to me, on a non-Darwin platform.
Is there a way I can generate the dsym inputs across Clang on a non-Darwin
platform? (what happens if I run dsymutil on my ELF object files?)

At this point probably nothing. Dsymutil acts on STABS symbol table
entries that are (I guess) not present in a typical ELF binary. Dsymutil
also only implements MachO relocations and has lots of other things where
the ELF implementation is missing. It’s probably not too much work to wire
all this up, but so far nobody did it.

& no easy way for me to get a representative (or pathalogically large,
even) set of machO files to play with, I take it? It's no worries - just
figured I'd give it a go if it was convenient.

At this point probably nothing. Dsymutil acts on STABS symbol table entries that are (I guess) not present in a typical ELF binary. Dsymutil also only implements MachO relocations and has lots of other things where the ELF implementation is missing. It’s probably not too much work to wire all this up, but so far nobody did it.

& no easy way for me to get a representative (or pathalogically large, even) set of machO files to play with, I take it? It’s no worries - just figured I’d give it a go if it was convenient.

I can definitely go and grab you a clang build directory from one of the green dragon bots for example; but all the paths are hardcoded so you’d have to install them in the exact same location. In theory everything doing file access should be handled by the LLVM low-level libraries, so this could work.

– adrian

Just in case it interests anyone else, I'm playing around with trying to
broaden the MCStreamer API to allow for emission of bytes without copying
the contents into a local buffer first (either because you already have a
buffer, or the bytes are already present in another file, etc) in
http://reviews.llvm.org/D17694 . In theory there's some overlap with
lld here (no doubt it already does this sort of thing, but not in a way, I
assume, we could reuse from other tools at the moment) and my motivation,
llvm-dwp, looks very much like "linking with a few extra steps".

But to check that these changes might be more generally applicable, I
thought I'd solicit data from anyone building tools that might be memory
constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of
llvm-dsymutil? Do you have an example you could provide that has high
memory usage, so I could see if any simple changes based on my prototype MC
changes would help.

Since dsymutil processes object files one after another,

As does llvm-dwp. Think of llvm-dwp more like a linker with a few extra
bits. But the MCStreamer API means any bytes you write to the streamer stay
in memory until you "Finish" - so if you're dwp/linking large enough
inputs, you have them all in memory when you really don't need them. For
example, the dwp file I was generating is 7GB, but the tool with the memory
improvements only has a high water mark of 2.3GB.

memory usage wasn’t really a problem so far, but you could try running
llvm-dsymutil on bin/clang for a larger example (takes about a minute to
finish).

Was thinking of something more accessible to me, on a non-Darwin
platform. Is there a way I can generate the dsym inputs across Clang on a
non-Darwin platform? (what happens if I run dsymutil on my ELF object
files?)

At this point probably nothing. Dsymutil acts on STABS symbol table
entries that are (I guess) not present in a typical ELF binary. Dsymutil
also only implements MachO relocations and has lots of other things where
the ELF implementation is missing. It’s probably not too much work to wire
all this up, but so far nobody did it.

& no easy way for me to get a representative (or pathalogically large,
even) set of machO files to play with, I take it? It's no worries - just
figured I'd give it a go if it was convenient.

I can definitely go and grab you a clang build directory from one of the
green dragon bots for example; but all the paths are hardcoded so you’d
have to install them in the exact same location. In theory everything doing
file access should be handled by the LLVM low-level libraries, so this
*could* work.

If you like/have time, feel free to throw them up somewhere I can download
them from.

I see. I guess there's a couple of ways you can go with llvm-dwp:
1) Extend MC with optional ownership as you are doing in your patch.
2) Modify llvm-dwp to write object files directly.

2 is what lld does (with the help of libObject) and might not be such a bad
choice, but it would be adding a lot of machinery for a very specific task
that MC already needs to know how to do in a roughly target-independent way,
so maybe it would be overkill.

I reckon that in most cases MC clients aren't going to be copying large
amounts of unowned data, they're most likely going to be creating that
data. So perhaps the implementation should reflect that somehow.

Specifically what I had in mind was that you could add some other derived class
of MCFragment that would store a StringRef (or maybe a vector of StringRefs
if that proves useful), and that would be unrelated to MCDataFragment.

WDYT?

Thanks,

Yes, I agree dsymutil could see slight gains from the parts you mention. It wouldn’t be groundbreaking though, the biggest contributor would always be the DIE tree that we recreate completely.

Fred

Just in case it interests anyone else, I’m playing around with trying to broaden the MCStreamer API to allow for emission of bytes without copying the contents into a local buffer first (either because you already have a buffer, or the bytes are already present in another file, etc) in http://reviews.llvm.org/D17694 . In theory there’s some overlap with lld here (no doubt it already does this sort of thing, but not in a way, I assume, we could reuse from other tools at the moment) and my motivation, llvm-dwp, looks very much like “linking with a few extra steps”.

But to check that these changes might be more generally applicable, I thought I’d solicit data from anyone building tools that might be memory constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of llvm-dsymutil? Do you have an example you could provide that has high memory usage, so I could see if any simple changes based on my prototype MC changes would help.

Since dsymutil processes object files one after another,

As does llvm-dwp. Think of llvm-dwp more like a linker with a few extra bits. But the MCStreamer API means any bytes you write to the streamer stay in memory until you “Finish” - so if you’re dwp/linking large enough inputs, you have them all in memory when you really don’t need them. For example, the dwp file I was generating is 7GB, but the tool with the memory improvements only has a high water mark of 2.3GB.

I’m a bit surprised by those numbers. If the output is 7GB, don’t you need to have a high watermark of 7GB at emission time even with your scheme?
Also, in D17694 you mention that the memory peak goes from 9.6GB to 2.3GB. Is this dirty memory or allocated memory? When investigating the memory use of dsymutil, I found out that the exponential growth of the MC vectors would hide the real memory usage (eg showing 2GB when the code actually used just a bit over 1GB).
Just curious, I think your approach makes a lot of sense.

Fred

>
> > Hi David,
> >
> > The way I imagined that we might want to extend the MCStreamer API
(this
> > was motivated by DIEData) is by allowing clients to move bytes and
fixups
> > into the MC layer.
> >
> > This is the sort of API that I was imagining:
> >
> > void MoveAndEmitFragment(SmallVectorImpl<char> &&Data,
> > SmallVectorImpl<MCFixup> &&Fixups);
> >
> > Note that this mirrors the fields
> > MCEncodedFragmentWithContents::Contents and
> > MCEncodedFragmentWithFixups::Fixups
> > and the arguments could be directly moved into the fields of a newly
> > created
> > MCDataFragment.
> >
> > Would that work for your use case?
> >
>
> Not quite, unfortunately - the issue is that we're doing a task that is
> essentially "linking + a bit" - so imagine linking a bunch of files
> together, the final, say, debug_info.dwo section is made up of the
> concatenation of all the debug_info.dwo sections of the inputs. So it's
> fragmented and it's already available, memory mapped, never in a
> SmallVector, etc.

I see. I guess there's a couple of ways you can go with llvm-dwp:
1) Extend MC with optional ownership as you are doing in your patch.
2) Modify llvm-dwp to write object files directly.

2 is what lld does (with the help of libObject) and might not be such a bad
choice, but it would be adding a lot of machinery for a very specific task
that MC already needs to know how to do in a roughly target-independent
way,
so maybe it would be overkill.

Yeah, if lld's code for doing this were more reusable that might be an
option, but I assume it isn't. (alternatively, could move dwp tool to be a
subproject of lld itself, a "dwp" driver that would just enable the special
treatment of cu/tu_index sections) At least for my needs, the modifications
to MC seem sufficiently unobtrusive & potentially generally useful
(eventually LLVM might care about the memory impact of MC - perhaps for
especially weird inputs (large amounts of static data, for example)).

I reckon that in most cases MC clients aren't going to be copying large
amounts of unowned data, they're most likely going to be creating that
data. So perhaps the implementation should reflect that somehow.

Yeah, even for that, though - they may not want to keep it all in memory.
For example one of the next largest memory costs in dwp is the str_offsets
section, where we emit a bunch of ints created by processing the input. I
know how big the output will be, but I don't want/need to allocate a vector
of all of them, if I could stream them out instead.

So in theory we could generalize more aggressively, rather than narrow down
the usage - if I could pass a thing that could be queried for size and
could write bytes to the underlying entity I could save that memory too. So
could LLVM - for example, type units wouldn't need to all be stored in
bytes before writing any part of them out, we could stream it out to disk.
(there's some buffering in Clang that adds another layer to get through -
so it's actually buffered twice, the MC changes only remove one layer, we'd
have to change clang (& change MC to not require pwrite to patch the
header) to avoid the buffering entirely)

Specifically what I had in mind was that you could add some other derived
class
of MCFragment that would store a StringRef (or maybe a vector of StringRefs
if that proves useful), and that would be unrelated to MCDataFragment.

(needs multiple StringRefs - the output section consists of the
concatenation of all the input sections - so in the simple case it's a
StringRef from each input)

But yeah, could possibly narrow down the usage. I haven't looked closely at
how the MCFragments are created/used/manipulated (some of this conversation
may be better placed in the Differential review thread, perhaps - I do
really appreciate your perspective)

Just in case it interests anyone else, I'm playing around with trying to
broaden the MCStreamer API to allow for emission of bytes without copying
the contents into a local buffer first (either because you already have a
buffer, or the bytes are already present in another file, etc) in
http://reviews.llvm.org/D17694 . In theory there's some overlap with lld
here (no doubt it already does this sort of thing, but not in a way, I
assume, we could reuse from other tools at the moment) and my motivation,
llvm-dwp, looks very much like "linking with a few extra steps".

But to check that these changes might be more generally applicable, I
thought I'd solicit data from anyone building tools that might be memory
constrained as well.

First that comes to mind (Eric suggested/mentioned) is llvm-dsymutil.

Adrian/Fred - do you guys ever have trouble with memory usage of
llvm-dsymutil? Do you have an example you could provide that has high
memory usage, so I could see if any simple changes based on my prototype MC
changes would help.

Since dsymutil processes object files one after another,

As does llvm-dwp. Think of llvm-dwp more like a linker with a few extra
bits. But the MCStreamer API means any bytes you write to the streamer stay
in memory until you "Finish" - so if you're dwp/linking large enough
inputs, you have them all in memory when you really don't need them. For
example, the dwp file I was generating is 7GB, but the tool with the memory
improvements only has a high water mark of 2.3GB.

I’m a bit surprised by those numbers. If the output is 7GB, don’t you
need to have a high watermark of 7GB at emission time even with your scheme?

Nope, which is the great thing - the input files are memory mapped (reading
with libObject) and by delaying the output a bit more, we can literally be
reading bytes from the memory mapped input and writing them out to the
output file - at no point do we then need to have the entire contents in
memory.

Also, in D17694 you mention that the memory peak goes from 9.6GB to 2.3GB.
Is this dirty memory or allocated memory?

Allocated - I used valgrind's --tool=massif to analyze the memory usage.

When investigating the memory use of dsymutil, I found out that the
exponential growth of the MC vectors would hide the real memory usage (eg
showing 2GB when the code actually used just a bit over 1GB).

True, there could be some allocated but undirtied pages. Not sure if
Valgrind accounts for that.