How to reduce the footprint of MDNodes? (About the comment you made at BOF LTO)

Hi Manman (and llvmdev),

I filed these two bugs to track the ideas that I was cooking:

http://llvm.org/bugs/show_bug.cgi?id=17891
http://llvm.org/bugs/show_bug.cgi?id=17892

TL;DR: I'm saying we should go from:

  !14 = metadata !{i32 786445, metadata !1, metadata !10, metadata !"y", i32 3, i64 32, i64 32, i64 32, i32 0, metadata !13}
to:
  !14 = metadata !"v12,14,y,3,0,32,32,32"(metadata !1, metadata !13)

I think that this is along the lines of what you're thinking of, but doesn't require doing specific optimizations for constant values (since most of those would be encoded into strings instead of showing up as operands).

This is a pretty big project, but I think the payoff would be a huge constant factor improvement in memory use. We should probably also use a similar approach for profile info.

-Chris

Hi Manman (and llvmdev),

I filed these two bugs to track the ideas that I was cooking:

http://llvm.org/bugs/show_bug.cgi?id=17891
http://llvm.org/bugs/show_bug.cgi?id=17892

TL;DR: I'm saying we should go from:

        !14 = metadata !{i32 786445, metadata !1, metadata !10, metadata !"y", i32 3, i64 32, i64 32, i64 32, i32 0, metadata !13}
to:
        !14 = metadata !"v12,14,y,3,0,32,32,32"(metadata !1, metadata !13)

I think that this is along the lines of what you're thinking of, but doesn't require doing specific optimizations for constant values (since most of those would be encoded into strings instead of showing up as operands).

This is a pretty big project, but I think the payoff would be a huge constant factor improvement in memory use. We should probably also use a similar approach for profile info.

We can take discussion of specific format into the bugs, but a more
compact encoding is definitely a good thing. Thanks for filing the
bugs to keep track.

-eric

So, I like where you're going here, but a few tweaks.

First, there are two things going on here: removing an indirection through
a referenced metadata node and flattening N values into a string inclusion.
Removing the indirection seems obvious strict goodness, my comments are
about the second part.

I'm moderately opposed to just encoding these in a string format. I think
we can do something substantially better both for space, time, and
readability. Fundamentally, there is no reason for the original metadata
node you describe to not *encode* its operands into a dense bit-packed blob
of memory. We can still expose APIs that manipulate them as separate
entities, and have the AsmPrinter and AsmParser map back and forth with
nice human-readable forms. But even a simple varint encoding will be both
smaller and faster than ascii.

Just to be clear, I still want the nice format (much like your proposed
format, but maybe with the numbers outside of the "s) in the textual IR, I
just think we should use a more direct and efficient in-memory encoding
(and in-bitcode encoding if that isn't already suitably dense).

David pointed out that the PR17892 is a microoptimization that may not even be worthwhile. I think that the flattening into a string is the important part.

I guess you could make it work, but would that actually be simpler than what is proposed? If it is denser, how much denser would it have to be to justify the complexity?

Where would the encoding schema be specified?

Note that there are simple things that can be done to make MDNodes more efficient in common cases. The CallbackVH is only necessary when pointing to Value*’s that are not MDNode/MDString, and Constants-other-than-GlobalValue. If we make MDNode detect when it has “all-immortal” operands (like most debug info nodes) then we could just store Value*’s directly. This would be a completely invisible implementation improvement, but would not provide the same level of improvement as the “flatten into strings” approach. The two are quite complementary.

-Chris

I'm moderately opposed to just encoding these in a string format. I think
we can do something substantially better both for space, time, and
readability. Fundamentally, there is no reason for the original metadata
node you describe to not *encode* its operands into a dense bit-packed blob
of memory. We can still expose APIs that manipulate them as separate
entities, and have the AsmPrinter and AsmParser map back and forth with
nice human-readable forms. But even a simple varint encoding will be both
smaller and faster than ascii.

I guess you could make it work, but would that actually be simpler than
what is proposed? If it is denser, how much denser would it have to be to
justify the complexity?

I don't think it would be more complex than a string encoding. At least,
I'm not imagining we want to be super clever here.

I could even imagine doing a versioned giant bitfield and using the version
to handle auto-upgrade...

Just to be clear, I still want the nice format (much like your proposed
format, but maybe with the numbers outside of the "s) in the textual IR, I
just think we should use a more direct and efficient in-memory encoding
(and in-bitcode encoding if that isn't already suitably dense).

Where would the encoding schema be specified?

Same question applies to a string encoding. We have to define the schema
somewhere clearly. I'm just lobbying for the textual IR and the APIs to
both operate directly on N fields, and just make the memory representation
dense.

Note that there are simple things that can be done to make MDNodes more
efficient in common cases. The CallbackVH is only necessary when pointing
to Value*’s that are not MDNode/MDString, and
Constants-other-than-GlobalValue. If we make MDNode detect when it has
“all-immortal” operands (like most debug info nodes) then we could just
store Value*’s directly. This would be a completely invisible
implementation improvement, but would not provide the same level of
improvement as the “flatten into strings” approach. The two are quite
complementary.

Yea, I'd rather go for at least a bit more dense than that, but maybe we
should do this step-by-step.

I'm moderately opposed to just encoding these in a string format. I think
we can do something substantially better both for space, time, and
readability. Fundamentally, there is no reason for the original metadata
node you describe to not *encode* its operands into a dense bit-packed blob
of memory. We can still expose APIs that manipulate them as separate
entities, and have the AsmPrinter and AsmParser map back and forth with
nice human-readable forms. But even a simple varint encoding will be both
smaller and faster than ascii.

I guess you could make it work, but would that actually be simpler than
what is proposed? If it is denser, how much denser would it have to be to
justify the complexity?

I don't think it would be more complex than a string encoding. At least,
I'm not imagining we want to be super clever here.

I could even imagine doing a versioned giant bitfield and using the
version to handle auto-upgrade...

Just to be clear, I still want the nice format (much like your proposed
format, but maybe with the numbers outside of the "s) in the textual IR, I
just think we should use a more direct and efficient in-memory encoding
(and in-bitcode encoding if that isn't already suitably dense).

Where would the encoding schema be specified?

Same question applies to a string encoding. We have to define the schema
somewhere clearly. I'm just lobbying for the textual IR and the APIs to
both operate directly on N fields, and just make the memory representation
dense.

The difference here is that debug info parsing code would know the schema
externally - so the metadata itself wouldn't have to be self-describing or
typed in any way. Just a flat series of bytes of a fixed size would be
sufficient. (then leaving out the fields that refer to other IR constructs
such as functions, variables, etc)

But if we could make general metadata generally more compact that'd be nice
too and maybe sufficient/instead/not worth the added complexity in debug
info code of pulling out fields in the debug info handling code.

If it makes it possible for humans to read, author, and adjust debug info
test cases, that would also be worth it IMO. I'm really unsatisfied by the
reliance on C-code-in-comments to figure out what on earth the debug info
came from.

Fair point - it's a worthy goal I had only half considered in this context
but may be worthwhile doing so.

But it may be at odds - increasing/changing the schema to the point of
human readability may not be the most terse/memory-efficient representation
for the actual compiler (I'm thinking of things like actually having
self-describing textual representation with the names of fields, etc - but
maybe it doesn't have to go that far).

Perhaps there's some middle ground, but it might be difficult to justify
avoiding compile time improvements for the sake of editable/hand-authorable
debug info test cases... we'll have to toss it around a bit more to see
what shakes out in this space, I suspect.

You must mean something other than I’m imagining :-). From your description, I think you’re describing that we have some new kind of “compressed mdnode” that bitpacks data in some way, and that we’d have the existing .ll and .bc syntax be magically compacted without a syntax change. Is that what you’re describing?

If so, that sounds really complicated: all of the readers would have to recognize the new format, and we lose alignment of in memory IR with the printed form (making debugging the compiler simpler).

If you’re talking about exposing this as syntax in .ll files (and encoding details in .bc files) then I’m not sure how this would work. You’d have to have the schema for each node described somewhere. Where would this exist.

More generally, can you explain more of what you’re thinking here?

The advantage of strings is that it moves the schema complexity to the debug info - related machinery like DIBuilder. The core IR features like MDNode, the asmparser/writer, bitcode support, etc don’t need anything specific to debug information.

-Chris

Where would the encoding schema be specified?

Same question applies to a string encoding. We have to define the schema
somewhere clearly. I'm just lobbying for the textual IR and the APIs to
both operate directly on N fields, and just make the memory representation
dense.

The difference here is that debug info parsing code would know the schema
externally - so the metadata itself wouldn't have to be self-describing or
typed in any way. Just a flat series of bytes of a fixed size would be
sufficient. (then leaving out the fields that refer to other IR constructs
such as functions, variables, etc)

But if we could make general metadata generally more compact that'd be
nice too and maybe sufficient/instead/not worth the added complexity in
debug info code of pulling out fields in the debug info handling code.

If it makes it possible for humans to read, author, and adjust debug info
test cases, that would also be worth it IMO. I'm really unsatisfied by the
reliance on C-code-in-comments to figure out what on earth the debug info
came from.

+1 This has been something that has been flustering me every time I see it.

-- Sean Silva