Proposal: type uniquing of debug info for LTO

!llvm.hardref.foo.h.myClass = ...;
!llvm.hardref.bar.h.myClass = ...;

I like this idea! It's much easier to read than metadata fields, but will it
need to have mangled names that today, are quoted?

Mangled in some form - using the actual language mangling's probably a
good idea, since that's the uniqueness we require anyway (this would
help/be necessary for local types in inline functions, etc - chances
are we would've had to do that even if we were using the previously
discussed string/mdnode pair scheme)

Do we still have the notion that named metadata cannot be deleted?

That's the invariant, yes. Named metadata are the roots of the
glorious metadata tree from which the debug info fruit grows. Or
something.

More details please :]

What do you mean by “name the metadata”? Are you referring to the name field

of the MDNode?

Using named metadata rather than unnamed metadata.

rather than having:

!llvm.hardref = !{metadata !“foo.h::myClass”, !3, metadata
!“bar.h:myEnum”, !4} …
!3 = …;
!4 = …;

we could simply have:

!llvm.hardref.foo.h.myClass = …;
!llvm.hardref.bar.h.myClass = …;

or something like that.

From the documentation:
NamedMDNode - a tuple of MDNodes. Despite its name, a NamedMDNode isn’t itself an MDNode. NamedMDNodes belong to modules, have names, and contain lists of MDNodes.

Our DI classes all wrap around a MDNode. Also during IR linking, NamedMDNode is handled differently.
Did I miss something?

Thanks,
Manman

using the actual language mangling's probably a good idea

Yup, sounds good.

That's the invariant, yes. Named metadata are the roots of the

glorious metadata tree from which the debug info fruit grows. Or
something.

So, that fits well today, since (from another thread) we generate debug
info when really needed, so types should not need to be trimmed.

If we ever generate types that later become unnecessary, we could bloat the
dwarf table, no?

cheers,
--renato

The question isn't necessarily valid - just because code is optimized
away doesn't mean the type is unnecessary. Users debugging code still
may want to know what the type was & how to interact with it, etc.

Yes, there are some corner cases (we've discussed emitting types only
into the TU with the key function, if there is one - but we'd still
have to emit the full definition into other TUs that say, have a
variable of that type - sure, if we optimized out that variable we
could maybe drop the type definition to a declaration)

More details please :]

What do you mean by "name the metadata"? Are you referring to the name field

of the MDNode?

Using named metadata rather than unnamed metadata.

rather than having:

!llvm.hardref = !{metadata !"foo.h::myClass", !3, metadata
!"bar.h:myEnum", !4} ...
!3 = ...;
!4 = ...;

we could simply have:

!llvm.hardref.foo.h.myClass = ...;
!llvm.hardref.bar.h.myClass = ...;

or something like that.

From the documentation:
NamedMDNode - a tuple of MDNodes. Despite its name, a NamedMDNode isn't
itself an MDNode. NamedMDNodes belong to modules, have names, and contain
lists of MDNodes.

Our DI classes all wrap around a MDNode. Also during IR linking, NamedMDNode
is handled differently.
Did I miss something?

Nope - just that I'm not especially familiar with this stuff, as I've mentioned.

So we could change from:

!llvm.hardref = !{metadata !"foo.h::myClass", !3, metadata
!"bar.h:myEnum", !4} ...
!3 = ...;
!4 = ...;

to

!llvm.hardref.foo.h.myClass = !3;
!llvm.hardref.bar.h.myClass = !4;

But that might be less beneficial, since there's still that indirection, etc.

(in any case this is orthogonal to the major issue under discussion -
it's equally applicable to either (a) or (b))

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

---------------------
Option b) I am going to expand David's description with more details, correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

-eric

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

---------------------
Option b) I am going to expand David's description with more details, correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to support MDHash" or something like that.
We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Thanks,
Manman

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

---------------------
Option b) I am going to expand David's description with more details, correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to support MDHash" or something like that.

I'd like to hear a rationale for wanting b first. I don't know (and
can think of) any other users of this functionality right now.

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

-eric

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
> ---------------------
> Option b) I am going to expand David's description with more details,
correct me if I am wrong.

I'll let David continue this part of the thread with you.

>
> I personally prefer option b over a since it is much cleaner.
>

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I can't say I've followed all the details but would this be useful
in moving toward support of .debug_types at all? There's a
superficial resemblance in conjuring up hashes of types referenced
from elsewhere, and I can't help wondering if the resemblance is
any deeper than that. (I'm not insisting that it has to be, at
worst it shouldn't get in the way if we can anticipate what would
be required for .debug_types to work.)

Thanks
--paulr

Nope. Completely orthogonal.

I've almost got dwarf4 type hashing ready for commit anyhow.

-eric

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don't have a handle to the map.

---------------------
Option b) I am going to expand David's description with more details, correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to support MDHash" or something like that.

I'd like to hear a rationale for wanting b first. I don't know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode's full content can be thought of as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode references.

If you think this is not general enough to be in MDNode, my earlier proposal of extending MDNode to DINode is going the other direction.
Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.

Thanks,
Manman

What is wrong with having the map in LLVMContext?

General principle of keeping things in the least scope necessary.

MDNode has easy access to the context.

Ease of access isn't really the base requirement - otherwise we could
just put everything in LLVMContext. Reducing scope helps code
maintenance by making the influence of certain data is only accessed
within a certain scope.

And we need the map in LLVMContext for option b as well.

Yes, but option (b) is a generic LLVM-wide feature necessarily
impacting such data structures. Option (a) is not, and should not have
such broad affects.

David mentioned it is too central. But we already have a few DI related data structures in LLVMContext.

Several of those data structures (so far as I know) are needed outside
the domain of debug info emission (the couldn't exist just inside
DwarfDebug, for example).

- David

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don't have a handle to the map.

The only case where dump currently navigates a type link that I know
is printing derived types, where we print "[from foo]". By just
printing the string identifier for the type (without having to
navigate the link/use the map) should be about as useful here.

---------------------
Option b) I am going to expand David's description with more details, correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to support MDHash" or something like that.

I'd like to hear a rationale for wanting b first. I don't know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode's full content can be thought of as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode references.

If you think this is not general enough to be in MDNode,

Eric's concern/question is whether it's a feature that would be useful
to any other code, I believe. (I'm less concerned about this so long
as the feature is appropriately general, even if it doesn't have any
other consumers at the moment - I don't think it'd drastically
complicate the Metadata handling code (which I think is currently
fairly simple, but I admit to not knowing it in any great detail))

my earlier proposal of extending MDNode to DINode is going the other direction.

As discussed, this isn't far enough in the other direction. If we're
going to touch core LLVM Metadata handling code at all, it needs to be
a complete/cohesive feature, not just tendrils of a debug info
feature. Hence (a) or (b) not (c).

Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.

This would probably tip me over into Eric's perspective that it would
be too esoteric a feature to add to a generic/core piece of
infrastructure.

Yes, I'm not sure we should resolve duplicates in this case, then
(short of assigning a separate identifier to declarations V
definitions so they don't collide/get deduped - then doing a pass in
DwarfDebug/during debug-info-generation time to resolve in favor of
definitions (hmm, yeah, even that doesn't quite make sense - you'd
have to be able to visit the things referring to the declarations &
update them to refer to definitions, and we'd have no real way to
discover those easily except walking all the debug info... could
work))

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don't have a handle to the map.

It can give you the identifier value for it, which, while not pretty
is likely just as good.

If you think this is not general enough to be in MDNode, my earlier proposal of extending MDNode to DINode is going the other direction.
Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
That requires a lot of changes too :frowning:

I'm still not seeing why your option is the only other one. I.e. my
first proposal.

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.

The contents of the node have forward declaration versus definition.
They won't be merged or RAUW'd. As you build up the hash table you can
make sure that definitions win in creating DIEs yes? i.e. just add the
full types to the map as you construct them and when you look up
DW_AT_type/DW_AT_specification you can replace with the correct DIE?

-eric

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.

Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I’ve not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don’t have a handle to the map.

It can give you the identifier value for it, which, while not pretty
is likely just as good.

Yes, but you can dump the content of the derived type in gdb | lldb if you have the MDNode pointer :]

If you think this is not general enough to be in MDNode, my earlier proposal of extending MDNode to DINode is going the other direction.
Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
That requires a lot of changes too :frowning:

I’m still not seeing why your option is the only other one. I.e. my
first proposal.

I didn’t say that my option is the only other one.
Do you mean option a) by “first proposal”?

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.

The contents of the node have forward declaration versus definition.
They won’t be merged or RAUW’d. As you build up the hash table you can
make sure that definitions win in creating DIEs yes?

By “the hash table”, are you referring to the map added to LLVMContext?

If you are, MDNode does not know whether it is a declaration or a definition (that is DI specific), so we can’t make
sure definitions win in the map. But if we allow mapping one hash value to a few MDNodes,
during DIE creation, we can make sure the definition wins by picking the definition. That may work.

Thanks,
Manman

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.

Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I’ve not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don’t have a handle to the map.

The only case where dump currently navigates a type link that I know
is printing derived types, where we print “[from foo]”. By just
printing the string identifier for the type (without having to
navigate the link/use the map) should be about as useful here.


Option b) I am going to expand David’s description with more details, correct me if I am wrong.

I’ll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn’t involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I’m fine with b as a choice, but it’s going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let’s start a new thread with title “Proposal: extend Metadata (MDNode) to support MDHash” or something like that.

I’d like to hear a rationale for wanting b first. I don’t know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode’s full content can be thought of as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode references.

If you think this is not general enough to be in MDNode,

Eric’s concern/question is whether it’s a feature that would be useful
to any other code, I believe. (I’m less concerned about this so long
as the feature is appropriately general, even if it doesn’t have any
other consumers at the moment - I don’t think it’d drastically
complicate the Metadata handling code (which I think is currently
fairly simple, but I admit to not knowing it in any great detail))

Yes, I tried to explain why MDHash can help with MDNodes that form a complicated graph.

my earlier proposal of extending MDNode to DINode is going the other direction.

As discussed, this isn’t far enough in the other direction. If we’re
going to touch core LLVM Metadata handling code at all, it needs to be
a complete/cohesive feature, not just tendrils of a debug info
feature. Hence (a) or (b) not (c).

Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.

This would probably tip me over into Eric’s perspective that it would
be too esoteric a feature to add to a generic/core piece of
infrastructure.

If we allow the map to map one hash value (or string) to multiple MDNodes, without adding the weight, it should still work.
But is that general enough?

Thanks,
Manman

Then there would be no transparent access through the nodes - so I
don't know what the access API would look like anymore & it would seem
rather awkward as a general API.

The more you point out the reasonable complications in using such a
generic API for your task, the less it seems like it would be
sufficient/a good fit - the more I'd be inclined to go with Eric's
choice of just keeping the mapping locally (in DwarfDebug, etc) & not
impacting metadata APIs at all.

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to
perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when
a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type
information, since we don't have a handle to the map.

The only case where dump currently navigates a type link that I know
is printing derived types, where we print "[from foo]". By just
printing the string identifier for the type (without having to
navigate the link/use the map) should be about as useful here.

---------------------
Option b) I am going to expand David's description with more details,
correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to
support MDHash" or something like that.

I'd like to hear a rationale for wanting b first. I don't know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode's full content can be thought of
as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full
content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode
references.

If you think this is not general enough to be in MDNode,

Eric's concern/question is whether it's a feature that would be useful
to any other code, I believe. (I'm less concerned about this so long
as the feature is appropriately general, even if it doesn't have any
other consumers at the moment - I don't think it'd drastically
complicate the Metadata handling code (which I think is currently
fairly simple, but I admit to not knowing it in any great detail))

Yes, I tried to explain why MDHash can help with MDNodes that form a
complicated graph.

my earlier proposal of extending MDNode to DINode is going the other
direction.

As discussed, this isn't far enough in the other direction. If we're
going to touch core LLVM Metadata handling code at all, it needs to be
a complete/cohesive feature, not just tendrils of a debug info
feature. Hence (a) or (b) not (c).

Basically, MDNode will have an opcode like the opcode for Instruction, to
support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class
according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support
multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a
definition, they both name the same class, but they are different.

This would probably tip me over into Eric's perspective that it would
be too esoteric a feature to add to a generic/core piece of
infrastructure.

If we allow the map to map one hash value (or string) to multiple MDNodes,
without adding the weight, it should still work.
But is that general enough?

Then there would be no transparent access through the nodes - so I
don't know what the access API would look like anymore & it would seem
rather awkward as a general API.

The access function will have two modes:
1> it does not matter which one is returned
(This works for verification purposes and printing purposes.)
2> return all the MDNodes having the hash, DwarfDebug will figure out which one has priority

It is not clean. I actually prefer adding a weight, it kind of makes sense when we are dealing with multiple MDNodes with the same hash.
With weight, we can always pick one with higher weight (or priority).

If we veto option b, we will only have option a, which has its own issues:
1> it is not really local to debug info, we have to modify AsmWriter at least.
2> we have to keep (and construct) one map at a few places.
3> we have to modify the following access function in DI classes
getContext, getTypeDerivedFrom, getClassType, getContainingType, getType
to take an extra map
and of course the call sites of all these functions.

Compare against putting the map in LLVMContext:
1> quoting from David
"General principle of keeping things in the least scope necessary.
Reducing scope helps code
maintenance by making the influence of certain data is only accessed
within a certain scope."

Another issue is on how to submit the patches:
I prefer to submit small patches and have a transition period where we use -gtype-uniquing to test the work in progress.
The advantages are
1> People can see our progress.
2> We can incrementally add more testing cases while testing the feature under the flag.
3> If the patch causes problem, we don't have to revert the big patch and submit another big patch that has the fix.

Thanks,
Manman

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to
perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when
a MDNode is replaced.
---------------------
Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I've not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type
information, since we don't have a handle to the map.

The only case where dump currently navigates a type link that I know
is printing derived types, where we print "[from foo]". By just
printing the string identifier for the type (without having to
navigate the link/use the map) should be about as useful here.

---------------------
Option b) I am going to expand David's description with more details,
correct me if I am wrong.

I'll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn't involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I'm fine with b as a choice, but it's going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let's start a new thread with title "Proposal: extend Metadata (MDNode) to
support MDHash" or something like that.

I'd like to hear a rationale for wanting b first. I don't know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode's full content can be thought of
as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full
content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode
references.

If you think this is not general enough to be in MDNode,

Eric's concern/question is whether it's a feature that would be useful
to any other code, I believe. (I'm less concerned about this so long
as the feature is appropriately general, even if it doesn't have any
other consumers at the moment - I don't think it'd drastically
complicate the Metadata handling code (which I think is currently
fairly simple, but I admit to not knowing it in any great detail))

Yes, I tried to explain why MDHash can help with MDNodes that form a
complicated graph.

my earlier proposal of extending MDNode to DINode is going the other
direction.

As discussed, this isn't far enough in the other direction. If we're
going to touch core LLVM Metadata handling code at all, it needs to be
a complete/cohesive feature, not just tendrils of a debug info
feature. Hence (a) or (b) not (c).

Basically, MDNode will have an opcode like the opcode for Instruction, to
support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class
according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support
multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a
definition, they both name the same class, but they are different.

This would probably tip me over into Eric's perspective that it would
be too esoteric a feature to add to a generic/core piece of
infrastructure.

If we allow the map to map one hash value (or string) to multiple MDNodes,
without adding the weight, it should still work.
But is that general enough?

Then there would be no transparent access through the nodes - so I
don't know what the access API would look like anymore & it would seem
rather awkward as a general API.

The access function will have two modes:
1> it does not matter which one is returned
(This works for verification purposes and printing purposes.)
2> return all the MDNodes having the hash, DwarfDebug will figure out which one has priority

It is not clean. I actually prefer adding a weight, it kind of makes sense when we are dealing with multiple MDNodes with the same hash.
With weight, we can always pick one with higher weight (or priority).

If we veto option b, we will only have option a, which has its own issues:
1> it is not really local to debug info, we have to modify AsmWriter at least.

I'm pretty sure I've kept saying we don't need to do this. Printing
asm can just print the identifier for the type node - it doesn't need
to navigate across that link.

2> we have to keep (and construct) one map at a few places.
3> we have to modify the following access function in DI classes
getContext, getTypeDerivedFrom, getClassType, getContainingType, getType
to take an extra map
and of course the call sites of all these functions.

Compare against putting the map in LLVMContext:
1> quoting from David
"General principle of keeping things in the least scope necessary.
Reducing scope helps code
maintenance by making the influence of certain data is only accessed
within a certain scope."

Another issue is on how to submit the patches:
I prefer to submit small patches and have a transition period where we use -gtype-uniquing to test the work in progress.

Incremental work doesn't require a flag. You can simply migrate one
field at a time. Is there a reason that wouldn't work?

A summary of options for issue #3:
3> To actually access the MDNode referenced via the hash value, we need to
perform a lookup from the hash value to find the corresponding MDNode.
The questions are where to store this map and how to keep it up-to-date when
a MDNode is replaced.

Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
functions in DI classes to pass in the map.
May need to modify printing functions in AsmWriter.

You may not need the map in DIBuilder if you piggy back on the
existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
for just debug dumps of the IR. A way around this might be nice, but
I’ve not looked into it.

Calling dump() on a MDNode inside gdb will not give us the derived type
information, since we don’t have a handle to the map.

The only case where dump currently navigates a type link that I know
is printing derived types, where we print “[from foo]”. By just
printing the string identifier for the type (without having to
navigate the link/use the map) should be about as useful here.


Option b) I am going to expand David’s description with more details,
correct me if I am wrong.

I’ll let David continue this part of the thread with you.

I personally prefer option b over a since it is much cleaner.

Option b is more general and should probably work. I suggested a
originally because I know it would work and the impact is limited to
debug info. Also it doesn’t involve touching the generic metadata
interface or code and could be done quickly and incrementally. b may
also require further changes to the debug info infrastructure to deal
with lookup, etc so may end up being as much churn as a. b is more
useful if we think other users of metadata may want/need similar
functionality in the future.

I’m fine with b as a choice, but it’s going to involve more planning
and code review and involve people outside of you, me and Dave :slight_smile:

Got it. David, are you okay with the details I added about option b?
Let’s start a new thread with title “Proposal: extend Metadata (MDNode) to
support MDHash” or something like that.

I’d like to hear a rationale for wanting b first. I don’t know (and
can think of) any other users of this functionality right now.

I will try my best :]
MDNode has a list of operands, one MDNode’s full content can be thought of
as a graph, when there is a reference to another MDNode, we have
an edge. The MDHash can be used to give a unique id (or string) to the full
content of the graph.

The hash value applies to any kind of MDNode with complicated MDNode
references.

If you think this is not general enough to be in MDNode,

Eric’s concern/question is whether it’s a feature that would be useful
to any other code, I believe. (I’m less concerned about this so long
as the feature is appropriately general, even if it doesn’t have any
other consumers at the moment - I don’t think it’d drastically
complicate the Metadata handling code (which I think is currently
fairly simple, but I admit to not knowing it in any great detail))

Yes, I tried to explain why MDHash can help with MDNodes that form a
complicated graph.

my earlier proposal of extending MDNode to DINode is going the other
direction.

As discussed, this isn’t far enough in the other direction. If we’re
going to touch core LLVM Metadata handling code at all, it needs to be
a complete/cohesive feature, not just tendrils of a debug info
feature. Hence (a) or (b) not (c).

Basically, MDNode will have an opcode like the opcode for Instruction, to
support TBAA, DI, and other types.
Inside streamers (bc ll reader), we can create the specialized class
according to the opcode.
That requires a lot of changes too :frowning:

We can add a weight to MDHash (other names are fine with me), to support
multiple MDNodes with the same hash, but different weight.

Wait, what? This sounds like a bad idea.

Why?
The weight is needed to differentiate a forward declaration vs. a
definition, they both name the same class, but they are different.

This would probably tip me over into Eric’s perspective that it would
be too esoteric a feature to add to a generic/core piece of
infrastructure.

If we allow the map to map one hash value (or string) to multiple MDNodes,
without adding the weight, it should still work.
But is that general enough?

Then there would be no transparent access through the nodes - so I
don’t know what the access API would look like anymore & it would seem
rather awkward as a general API.

The access function will have two modes:
1> it does not matter which one is returned
(This works for verification purposes and printing purposes.)
2> return all the MDNodes having the hash, DwarfDebug will figure out which one has priority

It is not clean. I actually prefer adding a weight, it kind of makes sense when we are dealing with multiple MDNodes with the same hash.
With weight, we can always pick one with higher weight (or priority).

If we veto option b, we will only have option a, which has its own issues:
1> it is not really local to debug info, we have to modify AsmWriter at least.

I’m pretty sure I’ve kept saying we don’t need to do this. Printing
asm can just print the identifier for the type node - it doesn’t need
to navigate across that link.

Okay, if we are fine with no access to the link in a gdb | lldb session.

2> we have to keep (and construct) one map at a few places.
3> we have to modify the following access function in DI classes
getContext, getTypeDerivedFrom, getClassType, getContainingType, getType
to take an extra map
and of course the call sites of all these functions.

Compare against putting the map in LLVMContext:
1> quoting from David
“General principle of keeping things in the least scope necessary.
Reducing scope helps code
maintenance by making the influence of certain data is only accessed
within a certain scope.”

Another issue is on how to submit the patches:
I prefer to submit small patches and have a transition period where we use -gtype-uniquing to test the work in progress.

Incremental work doesn’t require a flag. You can simply migrate one
field at a time. Is there a reason that wouldn’t work?

The reason I want a flag is to avoid the need to update the existing testing cases while this is a work in progress.
I believe migrating one field means updating the existing testing cases?

Thanks,
Manman