[RFC] Separating Metadata from the Value hierarchy

TL;DR: If you use metadata (especially if it's out-of-tree), check the
numbered list of lost functionality below to see whether I'm trying to
break your compiler permanently.

In response to my recent commits (e.g., [1]) that changed API from
`MDNode` to `Value`, Eric had a really interesting idea [2] -- split
metadata entirely from the `Value` hierarchy, and drop general support
for metadata RAUW.

After hacking together some example code, this seems overwhelmingly to
me like the right direction. See the attached metadata-v2.patch for my
sketch of what the current metadata primitives might look like in the
new hierarchy (includes LLVMContextImpl uniquing support).

The initial motivation was to avoid the API weaknesses caused by having
non-`MDNode` metadata that inherits from `Value`. In particular,
instead of changing API from `MDNode` to `Value`, change it to a base
class called `Metadata`, which sheds the underused and expensive `Value`
base class entirely.

The implications are broader: an enormous reduction in complexity (and
overhead) for metadata.

This change implies minor (major?) loss of functionality for metadata,
but Eric and I suspect that the only hard-to-fix user is debug info
itself, whose IR infrastructure I'm rewriting anyway.

Here is what we'd lose:

1. No more arbitrary RAUW of metadata.

    While we'd keep support for RAUW of temporary MDNodes for use as
    forward declarations (when parsing assembly or constructing cyclic
    graphs), drop RAUW support for all other metadata.

    Note that we'd also keep support for RAUW of `Value` operands of
    metadata.

    If the RAUW of an operand causes a uniquing collision, uniquing
    would be dropped for that node. This matches the current behaviour
    when an operand goes to null.

    Upgrade path: none.

2. No more function-local metadata.

    AFAICT, function-local metadata is *only* used for indirect
    references to instructions and arguments in `@llvm.dbg.value` and
    `@llvm.dbg.declare` intrinsics. The first argument of the following
    is an example:

        call void @llvm.dbg.value(metadata !{i32 %val}, metadata !0,
                                  metadata !1)

    Note that the debug info people uniformly seem to dislike the status
    quo, since it's awkward to get from a `Value` to the corresponding
    intrinsic.

    Upgrade path: Instead of using an intrinsic that references a
    function-local value through an `MDNode`, attach metadata to the
    corresponding argument or instruction, or to the terminating
    instruction of the basic block. (This requires new support for
    attaching metadata to function arguments, which I'll have to add for
    debug info anyway.)

Is this going to break your compiler? How? Why is your use case worth
supporting?

[1]: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141103/242667.html
    "r221167 - IR: MDNode => Value: Instruction::getAllMetadataOtherThanDebugLoc()"
[2]: http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078581.html
    "Re: First-class debug info IR: MDLocation"

metadata-v2.patch (30.3 KB)

FWIW, this completely addresses my only ill feelings about the changes you were making. I really like it.

I might bikeshed some of the names, but whatever. I would suggest maybe augmenting some of the doxygen comments to help show the specific use case that motivates the node? You already have this in a few places and it really helped me skimming the code. In particular, there are a bunch of very similar nodes around the tracking, temp, uniquable, etc. and I think it’ll be important to clearly distinguish each use case that these are designed to address.

Also just want to say thanks for diving into the design and finding something (at least, I’m hoping!) works even better.

FWIW, this completely addresses my only ill feelings about the changes you were making. I really like it.

I really like it, too.

Eric, thanks for the great suggestion!

From: "Duncan P. N. Exon Smith" <dexonsmith@apple.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Sunday, November 9, 2014 7:02:43 PM
Subject: [LLVMdev] [RFC] Separating Metadata from the Value hierarchy

TL;DR: If you use metadata (especially if it's out-of-tree), check
the
numbered list of lost functionality below to see whether I'm trying
to
break your compiler permanently.

In response to my recent commits (e.g., [1]) that changed API from
`MDNode` to `Value`, Eric had a really interesting idea [2] -- split
metadata entirely from the `Value` hierarchy, and drop general
support
for metadata RAUW.

After hacking together some example code, this seems overwhelmingly
to
me like the right direction. See the attached metadata-v2.patch for
my
sketch of what the current metadata primitives might look like in the
new hierarchy (includes LLVMContextImpl uniquing support).

The initial motivation was to avoid the API weaknesses caused by
having
non-`MDNode` metadata that inherits from `Value`. In particular,
instead of changing API from `MDNode` to `Value`, change it to a base
class called `Metadata`, which sheds the underused and expensive
`Value`
base class entirely.

The implications are broader: an enormous reduction in complexity
(and
overhead) for metadata.

This change implies minor (major?) loss of functionality for
metadata,
but Eric and I suspect that the only hard-to-fix user is debug info
itself, whose IR infrastructure I'm rewriting anyway.

Here is what we'd lose:

1. No more arbitrary RAUW of metadata.

While we'd keep support for RAUW of temporary MDNodes for use as
forward declarations (when parsing assembly or constructing cyclic
graphs), drop RAUW support for all other metadata.

So temporary MDNodes would be Values, but regular metadata would not be? Will regular metadata nodes no longer have lists of users? If I have a TrackingVH<MDNode> with temporary MDNodes, after I call replaceAllUsesWith, what happens?

I'm specifically wondering how we'll need to update CloneAliasScopeMetadata in lib/Transforms/Utils/InlineFunction.cpp, and I don't see any fundamental problem with what you've proposed, but these seem like generic upgrade questions.

Thanks again,
Hal

In response to my recent commits (e.g., [1]) that changed API from
`MDNode` to `Value`, Eric had a really interesting idea [2] -- split
metadata entirely from the `Value` hierarchy, and drop general support
for metadata RAUW.

Wow, this never occurred to me, but in hindsight seems like obviously the right direction.

Here is what we'd lose:

1. No more arbitrary RAUW of metadata.

   While we'd keep support for RAUW of temporary MDNodes for use as
   forward declarations (when parsing assembly or constructing cyclic
   graphs), drop RAUW support for all other metadata.

This seems fine to me, a corollary of this should be that MDNodes never “move around” in memory due to late uniquing etc, which means that TrackingVH shouldn’t be necessary, right? This should make all frontends a lot more memory efficient because they can just use raw pointers to MDNodes everywhere.

2. No more function-local metadata.

This was a great idea that never mattered, I’m happy to drop it for the greater good :slight_smile:

Some comments on the patch-in-progress:

+++ b/include/llvm/IR/MetadataV2.h
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/PointerUnion.h"
+#include "llvm/ADT/StringMap.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Support/ErrorHandling.h"

Please factor things to avoid including heavy headers like DenseMap and StringMap (e.g. by moving the Impl stuff out).

+class Metadata {
+private:
+ LLVMContext &Context;

Out of curiosity, why do Metadata nodes all need to carry around a context? This seems like memory bloat that would be great to avoid if possible.

+template <class T> class TypedMDRef : public MDRef {

Is this a safe way to model this, should it use private inheritance instead? Covariance seems like it would break this: specifically something could pass off a typed MDRef as an MDRef&, then the client could reassign something of the wrong type into it.

+/// \note This is really cheap and easy to support, and it's useful for
+/// implementing:

When this makes more progress, the comments should explain what it does and how it works, out of context of the patch.

+ MDString(LLVMContext *Context) : Metadata(*Context, MDStringKind) {
+ assert(Context && "Expected non-null context");
...
+ static MDStringRef get(LLVMContext &Context, StringRef String);

You have reference/pointer disagreement, I’d recommend taking LLVMContext& to the ctor for uniformity (and remove the assert).

+class ReplaceableMetadataImpl {

I’m not following your algorithm intentions well yet, but this seems like a really heavy-weight implementation, given that this is a common base class for a number of other important things.

+class ValueAsMetadata : public Metadata, ReplaceableMetadataImpl {

I think that ValueAsMetadata is a great thing to have - it is essential to refer to global variables etc. That said, I think that it will eventually be worthwhile to have metadata versions of integers and other common things to reduce memory. This should come in a later pass of course.

I don’t follow the point of
+class UniquableMDNode : public MDNode {

What would subclass it? What is the use-case? Won’t MDStrings continue to be uniqued?

Overall, I’m thrilled to see this direction, thanks for pushing forward on it Duncan!

-Chris

So, none of them would be `Value`s.

`TempMDNode` supports its own, non-`Value`-based, RAUW via
`ReplaceableMetadataImpl` (`ValueAsMetadata` uses the same).

`CloneAliasScopeMetadata()` should use `TrackingMDRef` in place of
`TrackingVH<MDNode>`. `TrackingMDRef` will register itself with the
`Metadata` if it's replaceable, and if/when it gets RAUW'ed, the pointer
will get updated.

If you have another look at the patch it might be more clear now.

BTW, another thing I added in that sketch was the option to explicitly
request a non-uniqued `MDNode`. Haven't thought through details, but I
was specifically thinking this would be a cleaner way to create your
non-uniquable alias nodes (instead of the current self-referential
approach).

My working straw man for assembly syntax looks like this:

    !1 = metadata distinct !{ ... }

As an example, the alias "root" nodes could change from this:

    !1 = metadata !{ metadata !1 }

to this:

    !1 = metadata distinct !{}

Which means constructing them would change from this:

    MDNode *T = MDNode::getTemporary(Context, None)
    MDNode *N = MDNode::get(Context, {T});
    T->replaceAllUsesWith(N);
    MDNode::deleteTemporary(T);

to this:

    // In that patch it's "getNonUniqued()".
    MDNode *N = MDNode::getDistinct(Context, None);

Furthermore, if all references to the alias root got dropped, they'd
clean themselves up instead of keeping each other alive in a cycle.

In response to my recent commits (e.g., [1]) that changed API from
`MDNode` to `Value`, Eric had a really interesting idea [2] -- split
metadata entirely from the `Value` hierarchy, and drop general support
for metadata RAUW.

Wow, this never occurred to me, but in hindsight seems like obviously the right direction.

Yup, good on Eric here.

Here is what we'd lose:

1. No more arbitrary RAUW of metadata.

  While we'd keep support for RAUW of temporary MDNodes for use as
  forward declarations (when parsing assembly or constructing cyclic
  graphs), drop RAUW support for all other metadata.

This seems fine to me, a corollary of this should be that MDNodes never “move around” in memory due to late uniquing etc, which means that TrackingVH shouldn’t be necessary, right? This should make all frontends a lot more memory efficient because they can just use raw pointers to MDNodes everywhere.

Almost.

Two caveats:

1. Handles should generally be `MDRef` instead of `Metadata*`, since I
    threw in reference-counting semantics so that no-longer-referenced
    metadata cleans itself up.

2. If a handle might point to a forward reference -- i.e., a
    `TempMDNode` in this patch -- it should use `TrackingMDRef`. When
    the forward reference gets resolved, it updates all of its tracking
    references.

Nevertheless, `sizeof(MDRef) == sizeof(TrackingMDRef) == sizeof(void*)`.
Frontends *only* pay memory overhead for the currently-unresolved
forward references.

(Maybe `MDNodeFwdRef` is a better name than `TempMDNode`?)

2. No more function-local metadata.

This was a great idea that never mattered, I’m happy to drop it for the greater good :slight_smile:

Awesome.

+class Metadata {
+private:
+ LLVMContext &Context;

Out of curiosity, why do Metadata nodes all need to carry around a context? This seems like memory bloat that would be great to avoid if possible.

There are two uses for the context:

  - RAUW. Metadata that can RAUW (`TempMDNode` and `MetadataAsValue`)
    need a context. Other metadata need access to a context when their
    operands get RAUW'ed (so they can re-unique themselves), but this
    could be passed in as an argument, so they don't need their own.
    
  - Reference counting. My sketch uses reference counting for all the
    nodes, so they all need a context to delete themselves when their
    last reference gets dropped.

Customized ownership of debug info metadata will allow us to get the
context from parent pointers, so we might be able to remove this down
the road (depending on whether the `Metadata` subclass is uniqued).

+template <class T> class TypedMDRef : public MDRef {

Is this a safe way to model this, should it use private inheritance instead? Covariance seems like it would break this: specifically something could pass off a typed MDRef as an MDRef&, then the client could reassign something of the wrong type into it.

Good point.

+ MDString(LLVMContext *Context) : Metadata(*Context, MDStringKind) {
+ assert(Context && "Expected non-null context");
...
+ static MDStringRef get(LLVMContext &Context, StringRef String);

You have reference/pointer disagreement, I’d recommend taking LLVMContext& to the ctor for uniformity (and remove the assert).

This is a workaround for `StringMapEntry`'s imperfect forwarding --
it passes the `InitVal` constructor argument by value.

I'll fix the problem at its source and clean this up.

+class ReplaceableMetadataImpl {

I’m not following your algorithm intentions well yet, but this seems like a really heavy-weight implementation, given that this is a common base class for a number of other important things.

Unfortunately I think this weight will be hard to optimize out (although
I'm sure it could be a little smaller).

In the `Value` hierarchy, you pay memory for RAUW at the site of each
`Use`. This makes sense, since most values can be RAUW'ed.

Since very little metadata needs to be RAUW'ed, we don't want to pay
memory at every use-site. Instead, we pay the cost inside the
(hopefully) few instances that support RAUW.

The core assumption is that there are relatively few replaceable
metadata instances. The only subclasses are `ValueAsMetadata` and
`TempMDNode`. How many of these are live?

  - There will be at most one `ValueAsMetadata` instance for each
    metadata operand that points at a `Value`. My data from a couple of
    months ago showed that there are very few of these.

  - `TempMDNodes` are used as forward references. You only pay their
    cost until the forward reference gets resolved (i.e., deleted).

The main case I'm worried about here are references to `ConstantInt`s,
which are used pretty heavily outside of debug info (e.g., in `!prof`
attachments). However, if that shows up in a profile, we can bypass
`Value`s entirely by creating a new `MDIntArray` subclass (or
something).

+class ValueAsMetadata : public Metadata, ReplaceableMetadataImpl {

I think that ValueAsMetadata is a great thing to have - it is essential to refer to global variables etc. That said, I think that it will eventually be worthwhile to have metadata versions of integers and other common things to reduce memory. This should come in a later pass of course.

Yup, agree entirely.

I don’t follow the point of
+class UniquableMDNode : public MDNode {

What would subclass it? What is the use-case? Won’t MDStrings continue to be uniqued?

I think you've just been misled by my terrible name.

This sketch splits the "temporary" concept for `MDNode` into a separate
subclass, so that non-forward-reference `MDNode`s don't have to pay for
RAUW overhead.

The class hierarchy I envision looks something like this:

    Metadata
      MDNode
        TempMDNode // MDNodeFwdRef?
        UniquableMDNode // GenericMDNode?
        DINode // Is this layer useful?
          DILocation
          DIScope
            DIType
              DIBasicType
              DICompositeType
              ...
            DISubprogram
            ...
          DICompileUnit
          ...
      MDString
      ValueAsMetadata

`UniquableMDNode` is a leaf-class that behaves like the current `MDNode`
(when it's not a temporary). I called it "uniquable" because, unlike
`TempMDNode`, it is uniqued by default (although you can opt-out of it,
and uniquing might be dropped).

Maybe a better name is `GenericMDNode`?

Also, off-topic, but after sketching out the imagined hierarchy above,
I'm not sure `DINode` is particularly useful.

From: "Duncan P. N. Exon Smith" <dexonsmith@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, November 10, 2014 11:08:24 AM
Subject: Re: [LLVMdev] [RFC] Separating Metadata from the Value hierarchy

>
>> From: "Duncan P. N. Exon Smith" <dexonsmith@apple.com>
>> To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
>> Sent: Sunday, November 9, 2014 7:02:43 PM
>> Subject: [LLVMdev] [RFC] Separating Metadata from the Value
>> hierarchy
>>
>>
>>
>> TL;DR: If you use metadata (especially if it's out-of-tree), check
>> the
>> numbered list of lost functionality below to see whether I'm
>> trying
>> to
>> break your compiler permanently.
>>
>> In response to my recent commits (e.g., [1]) that changed API from
>> `MDNode` to `Value`, Eric had a really interesting idea [2] --
>> split
>> metadata entirely from the `Value` hierarchy, and drop general
>> support
>> for metadata RAUW.
>>
>> After hacking together some example code, this seems
>> overwhelmingly
>> to
>> me like the right direction. See the attached metadata-v2.patch
>> for
>> my
>> sketch of what the current metadata primitives might look like in
>> the
>> new hierarchy (includes LLVMContextImpl uniquing support).
>>
>> The initial motivation was to avoid the API weaknesses caused by
>> having
>> non-`MDNode` metadata that inherits from `Value`. In particular,
>> instead of changing API from `MDNode` to `Value`, change it to a
>> base
>> class called `Metadata`, which sheds the underused and expensive
>> `Value`
>> base class entirely.
>>
>> The implications are broader: an enormous reduction in complexity
>> (and
>> overhead) for metadata.
>>
>> This change implies minor (major?) loss of functionality for
>> metadata,
>> but Eric and I suspect that the only hard-to-fix user is debug
>> info
>> itself, whose IR infrastructure I'm rewriting anyway.
>>
>> Here is what we'd lose:
>>
>> 1. No more arbitrary RAUW of metadata.
>>
>> While we'd keep support for RAUW of temporary MDNodes for use as
>> forward declarations (when parsing assembly or constructing cyclic
>> graphs), drop RAUW support for all other metadata.
>
> So temporary MDNodes would be Values, but regular metadata would
> not be? Will regular metadata nodes no longer have lists of users?
> If I have a TrackingVH<MDNode> with temporary MDNodes, after I
> call replaceAllUsesWith, what happens?
>
> I'm specifically wondering how we'll need to update
> CloneAliasScopeMetadata in
> lib/Transforms/Utils/InlineFunction.cpp, and I don't see any
> fundamental problem with what you've proposed, but these seem like
> generic upgrade questions.
>
> Thanks again,
> Hal

So, none of them would be `Value`s.

`TempMDNode` supports its own, non-`Value`-based, RAUW via
`ReplaceableMetadataImpl` (`ValueAsMetadata` uses the same).

`CloneAliasScopeMetadata()` should use `TrackingMDRef` in place of
`TrackingVH<MDNode>`. `TrackingMDRef` will register itself with the
`Metadata` if it's replaceable, and if/when it gets RAUW'ed, the
pointer
will get updated.

If you have another look at the patch it might be more clear now.

Yes, thanks!

BTW, another thing I added in that sketch was the option to
explicitly
request a non-uniqued `MDNode`. Haven't thought through details, but
I
was specifically thinking this would be a cleaner way to create your
non-uniquable alias nodes (instead of the current self-referential
approach).

My working straw man for assembly syntax looks like this:

    !1 = metadata distinct !{ ... }

As an example, the alias "root" nodes could change from this:

    !1 = metadata !{ metadata !1 }

to this:

    !1 = metadata distinct !{}

Which means constructing them would change from this:

    MDNode *T = MDNode::getTemporary(Context, None)
    MDNode *N = MDNode::get(Context, {T});
    T->replaceAllUsesWith(N);
    MDNode::deleteTemporary(T);

to this:

    // In that patch it's "getNonUniqued()".
    MDNode *N = MDNode::getDistinct(Context, None);

Furthermore, if all references to the alias root got dropped, they'd
clean themselves up instead of keeping each other alive in a cycle.

This sounds like a great improvement!

-Hal

>
>> From: "Duncan P. N. Exon Smith" <dexonsmith@apple.com>
>> To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
>> Sent: Sunday, November 9, 2014 7:02:43 PM
>> Subject: [LLVMdev] [RFC] Separating Metadata from the Value hierarchy
>>
>>
>>
>> TL;DR: If you use metadata (especially if it's out-of-tree), check
>> the
>> numbered list of lost functionality below to see whether I'm trying
>> to
>> break your compiler permanently.
>>
>> In response to my recent commits (e.g., [1]) that changed API from
>> `MDNode` to `Value`, Eric had a really interesting idea [2] -- split
>> metadata entirely from the `Value` hierarchy, and drop general
>> support
>> for metadata RAUW.
>>
>> After hacking together some example code, this seems overwhelmingly
>> to
>> me like the right direction. See the attached metadata-v2.patch for
>> my
>> sketch of what the current metadata primitives might look like in the
>> new hierarchy (includes LLVMContextImpl uniquing support).
>>
>> The initial motivation was to avoid the API weaknesses caused by
>> having
>> non-`MDNode` metadata that inherits from `Value`. In particular,
>> instead of changing API from `MDNode` to `Value`, change it to a base
>> class called `Metadata`, which sheds the underused and expensive
>> `Value`
>> base class entirely.
>>
>> The implications are broader: an enormous reduction in complexity
>> (and
>> overhead) for metadata.
>>
>> This change implies minor (major?) loss of functionality for
>> metadata,
>> but Eric and I suspect that the only hard-to-fix user is debug info
>> itself, whose IR infrastructure I'm rewriting anyway.
>>
>> Here is what we'd lose:
>>
>> 1. No more arbitrary RAUW of metadata.
>>
>> While we'd keep support for RAUW of temporary MDNodes for use as
>> forward declarations (when parsing assembly or constructing cyclic
>> graphs), drop RAUW support for all other metadata.
>
> So temporary MDNodes would be Values, but regular metadata would not be?
Will regular metadata nodes no longer have lists of users? If I have a
TrackingVH<MDNode> with temporary MDNodes, after I call replaceAllUsesWith,
what happens?
>
> I'm specifically wondering how we'll need to update
CloneAliasScopeMetadata in lib/Transforms/Utils/InlineFunction.cpp, and I
don't see any fundamental problem with what you've proposed, but these seem
like generic upgrade questions.
>
> Thanks again,
> Hal

So, none of them would be `Value`s.

`TempMDNode` supports its own, non-`Value`-based, RAUW via
`ReplaceableMetadataImpl` (`ValueAsMetadata` uses the same).

`CloneAliasScopeMetadata()` should use `TrackingMDRef` in place of
`TrackingVH<MDNode>`. `TrackingMDRef` will register itself with the
`Metadata` if it's replaceable, and if/when it gets RAUW'ed, the pointer
will get updated.

If you have another look at the patch it might be more clear now.

BTW, another thing I added in that sketch was the option to explicitly
request a non-uniqued `MDNode`. Haven't thought through details, but I
was specifically thinking this would be a cleaner way to create your
non-uniquable alias nodes (instead of the current self-referential
approach).

My working straw man for assembly syntax looks like this:

    !1 = metadata distinct !{ ... }

As an example, the alias "root" nodes could change from this:

    !1 = metadata !{ metadata !1 }

to this:

    !1 = metadata distinct !{}

Which means constructing them would change from this:

    MDNode *T = MDNode::getTemporary(Context, None)
    MDNode *N = MDNode::get(Context, {T});
    T->replaceAllUsesWith(N);
    MDNode::deleteTemporary(T);

to this:

    // In that patch it's "getNonUniqued()".
    MDNode *N = MDNode::getDistinct(Context, None);

Furthermore, if all references to the alias root got dropped, they'd
clean themselves up instead of keeping each other alive in a cycle.

There are at least two bugs I know of in debug info that are due to
overzealous metadata uniqing:

1) inline two calls to the same function on the same line and we end up
with one inlined call (inlined location + original location are the same,
so the scope becomes one scope instead of two). We workaround this in the
frontend by adding column info to calls (but we don't do it for constructor
calls or member function calls... ) but any workaround is insufficient as,
in the limit, you can put the two calls in a macro and then they'll be
attributed to the same line and column. (I cc'd you on a bug about this a
week or two ago)

2) two anonymous structs in the same location (again, even adding column
info doesn't save you from a macro) end up as one struct... (which gets
weird with duplicate members, etc)

There could easily be others, too.

Why not just make the Debug classes totally first class, i.e., the root of their own hierarchy? Given that you’re removing local metadata, fixing up dbgvalue, and debug locations are already stored on instructions in a special way, I don’t see why we need to even represent debug info as metadata at all.

Debug info is already being treated specially in parsing, printing, storage, etc. Seems like having Module contain a pointer to its CompileUnit(s) would move a huge amount of this stuff completely out of metadata. Then you can still have your own rules for uniquing if you want them.

Pete

The class hierarchy I envision looks something like this:

   Metadata
     MDNode
       TempMDNode // MDNodeFwdRef?
       UniquableMDNode // GenericMDNode?
       DINode // Is this layer useful?
         DILocation
         DIScope
           DIType
             DIBasicType
             DICompositeType
             ...
           DISubprogram
           ...
         DICompileUnit
         ...
     MDString
     ValueAsMetadata

Why not just make the Debug classes totally first class, i.e., the root of their own hierarchy? Given that you’re removing local metadata, fixing up dbgvalue, and debug locations are already stored on instructions in a special way, I don’t see why we need to even represent debug info as metadata at all.

Two reasons jump out:

1. It's serving as a nice forcing function for cleaning up the rest
    of metadata.

2. I suspect the rules for "how to deal with debug info" would look
    pretty similar to those for "how to deal with metadata".

Debug info is already being treated specially in parsing, printing, storage, etc. Seems like having Module contain a pointer to its CompileUnit(s) would move a huge amount of this stuff completely out of metadata. Then you can still have your own rules for uniquing if you want them.

Yup, that's the plan, except for the "out of metadata" part. Once
the hierarchy is fixed I'll create custom types for debug info,
whose ownership (and uniquing where appropriate) I'll customize. I
agree that a module should own its compile unit(s) directly.

If there's a compelling reason to separate debug info entirely, the
planned work is IMO the right incremental step to get there. We
need to customize the ownership first. But cleaning up metadata and
customizing debug info ownership (etc.) could be sufficient.

The class hierarchy I envision looks something like this:

Metadata
MDNode
TempMDNode // MDNodeFwdRef?
UniquableMDNode // GenericMDNode?
DINode // Is this layer useful?
DILocation
DIScope
DIType
DIBasicType
DICompositeType

DISubprogram

DICompileUnit

MDString
ValueAsMetadata

Why not just make the Debug classes totally first class, i.e., the root of their own hierarchy? Given that you’re removing local metadata, fixing up dbgvalue, and debug locations are already stored on instructions in a special way, I don’t see why we need to even represent debug info as metadata at all.

Because it fits the general idea of metadata. The same ideas behind how to deal with debug info is the same as how to deal with metadata.

Debug info is already being treated specially in parsing, printing, storage, etc. Seems like having Module contain a pointer to its CompileUnit(s) would move a huge amount of this stuff completely out of metadata. Then you can still have your own rules for uniquing if you want them.

I disagree with this in that I don’t think it’s a necessary part of the design. With these changes there’s no compelling argument to separate out debug info at all.

-eric

I’m not following your algorithm intentions well yet, but this seems like a really heavy-weight implementation, given that this is a common base class for a number of other important things.

Unfortunately I think this weight will be hard to optimize out (although
I’m sure it could be a little smaller).

In the Value hierarchy, you pay memory for RAUW at the site of each
Use. This makes sense, since most values can be RAUW’ed.

Since very little metadata needs to be RAUW’ed, we don’t want to pay
memory at every use-site. Instead, we pay the cost inside the
(hopefully) few instances that support RAUW.

Agreed. I don’t see us needing to RAUW debug info outside of llvm-link and the forward references. For other metadata uses RAUW doesn’t seem to be used at all (at least a quick glance didn’t show TempMDNodes anywhere).

The core assumption is that there are relatively few replaceable
metadata instances. The only subclasses are ValueAsMetadata and
TempMDNode. How many of these are live?

  • There will be at most one ValueAsMetadata instance for each
    metadata operand that points at a Value. My data from a couple of
    months ago showed that there are very few of these.

  • TempMDNodes are used as forward references. You only pay their
    cost until the forward reference gets resolved (i.e., deleted).

The main case I’m worried about here are references to ConstantInts,
which are used pretty heavily outside of debug info (e.g., in !prof
attachments). However, if that shows up in a profile, we can bypass
Values entirely by creating a new MDIntArray subclass (or
something).

This makes sense.

The class hierarchy I envision looks something like this:

Metadata
MDNode
TempMDNode // MDNodeFwdRef?
UniquableMDNode // GenericMDNode?
DINode // Is this layer useful?
DILocation
DIScope
DIType
DIBasicType
DICompositeType

DISubprogram

DICompileUnit

FWIW technically a compile unit is a scope so would probably need to be under that. Otherwise I’m missing a change rationale (I haven’t read the full patch yet, I apologize if it’s clear from the patch).


MDString
ValueAsMetadata

UniquableMDNode is a leaf-class that behaves like the current MDNode
(when it’s not a temporary). I called it “uniquable” because, unlike
TempMDNode, it is uniqued by default (although you can opt-out of it,
and uniquing might be dropped).

Maybe a better name is GenericMDNode?

Also, off-topic, but after sketching out the imagined hierarchy above,
I’m not sure DINode is particularly useful.

Possibly? Might be from the “organizing all of the metadata types” level?

-eric

The class hierarchy I envision looks something like this:

Metadata
MDNode
TempMDNode // MDNodeFwdRef?
UniquableMDNode // GenericMDNode?
DINode // Is this layer useful?
DILocation
DIScope
DIType
DIBasicType
DICompositeType

DISubprogram

DICompileUnit

MDString
ValueAsMetadata

Why not just make the Debug classes totally first class, i.e., the root of their own hierarchy? Given that you’re removing local metadata, fixing up dbgvalue, and debug locations are already stored on instructions in a special way, I don’t see why we need to even represent debug info as metadata at all.

Two reasons jump out:

  1. It’s serving as a nice forcing function for cleaning up the rest
    of metadata.

  2. I suspect the rules for “how to deal with debug info” would look
    pretty similar to those for "how to deal with metadata”.

Good point.

Debug info is already being treated specially in parsing, printing, storage, etc. Seems like having Module contain a pointer to its CompileUnit(s) would move a huge amount of this stuff completely out of metadata. Then you can still have your own rules for uniquing if you want them.

Yup, that’s the plan, except for the “out of metadata” part. Once
the hierarchy is fixed I’ll create custom types for debug info,
whose ownership (and uniquing where appropriate) I’ll customize. I
agree that a module should own its compile unit(s) directly.

Sounds good.

If there’s a compelling reason to separate debug info entirely, the
planned work is IMO the right incremental step to get there. We
need to customize the ownership first. But cleaning up metadata and
customizing debug info ownership (etc.) could be sufficient.

Totally agree with this approach. Whether or not separating debug info out from metadata is a good idea can definitely wait til later, and may not be needed at all.

Cheers,
Pete

My mistake, just thinking out loud. (None of this debug info stuff is
in the patch, it'll just come out naturally when I get back to the
debug info proposals.)

I generally support this direction. I do not know of any current use case which would be broken by this change. I do have some hesitations about potential future use though. (see below) I hesitate a bit here. The current range metadata deals only with constant ranges, but I could see a day where encoding a range in terms of two other Values might be useful. This is one obvious use case, but there may also be others. I’m not going to oppose the proposal made, but if there was a way to preserve the ability to reference to function local SSA values, I’d slightly prefer that.

In theory it's possible, but it adds a large complexity burden
because of the need to track whether metadata is local or global.
I strongly prefer dropping it until a compelling use-case appears.

Note that debug info is rather special in that it's *not allowed*
to modify optimizations or code generation. For non-debug info,
you can reference local values directly using intrinsics, as (e.g.)
@llvm.assume [1] does.

[1]: http://llvm.org/docs/LangRef.html#llvm-assume-intrinsic

If you don't care about function-local metadata and debug info
intrinsics, skip ahead to the section on assembly syntax in case you
have comments on that.

If you don't care about function-local metadata and debug info
intrinsics, skip ahead to the section on assembly syntax in case you
have comments on that.

2. No more function-local metadata.

  AFAICT, function-local metadata is *only* used for indirect
  references to instructions and arguments in `@llvm.dbg.value` and
  `@llvm.dbg.declare` intrinsics. The first argument of the following
  is an example:

      call void @llvm.dbg.value(metadata !{i32 %val}, metadata !0,
                                metadata !1)

  Note that the debug info people uniformly seem to dislike the status
  quo, since it's awkward to get from a `Value` to the corresponding
  intrinsic.

  Upgrade path: Instead of using an intrinsic that references a
  function-local value through an `MDNode`, attach metadata to the
  corresponding argument or instruction, or to the terminating
  instruction of the basic block. (This requires new support for
  attaching metadata to function arguments, which I'll have to add for
  debug info anyway.)

llvm::Argument attachments are hard

I've been looking at prototyping metadata attachments to
`llvm::Argument`, which is key to replacing debug info intrinsics.

It's a fairly big piece of new IR, and comes with its own subtle
semantic decisions. What do you do with metadata attached to arguments
when you inline a function? If the arguments are remapped to other
instructions (or arguments), they may have metadata of the same kind
attached. Does one win? Which one? Or are they merged? What if the
arguments get remapped to constants? What about when a function is
cloned?

While the rest of this metadata-is-not-a-value proposal is effectively
NFC, this `Argument` part could introduce problems. If I rewrite debug
info intrinsics as argument attachments and then immediately split
`Metadata` from `Value`, any semantic subtleties will be difficult to
diagnose in the noise of the rest of the changes.

While I was looking at this as a shortcut to avoid porting
function-local metadata, I think it introduces more points of failure
than problems it solves.

One thing to consider is that there are cases were we describe function arguments without referencing the argument in the intrinsic at all. Currently, if you compile

  $ cat struct.c
  struct s { int a; int b; };
  int foo(struct s s1) { return s1.a; }

  $ clang -g -O1 -arch x86_64 -S -emit-llvm struct.c

we cannot preserve the debug info for the argument so it is turned into an intrinsic describing an undef value.

  ; Function Attrs: nounwind readnone ssp uwtable
  define i32 @foo(i64 %s1.coerce) #0 {
  entry:
    %s1.sroa.0.0.extract.trunc = trunc i64 %s1.coerce to i32
    tail call void @llvm.dbg.declare(metadata !18, metadata !14, metadata !19), !dbg !20
    ret i32 %s1.sroa.0.0.extract.trunc, !dbg !21
  }

  !18 = metadata !{%struct.s* undef}

Note that it is critical for this DIVariable to make it into the debug info even if it is undefined, or the function arguments won’t match the function signature.

-- adrian

>
> If you don't care about function-local metadata and debug info
> intrinsics, skip ahead to the section on assembly syntax in case you
> have comments on that.
>
>>
>> 2. No more function-local metadata.
>>
>> AFAICT, function-local metadata is *only* used for indirect
>> references to instructions and arguments in `@llvm.dbg.value` and
>> `@llvm.dbg.declare` intrinsics. The first argument of the following
>> is an example:
>>
>> call void @llvm.dbg.value(metadata !{i32 %val}, metadata !0,
>> metadata !1)
>>
>> Note that the debug info people uniformly seem to dislike the status
>> quo, since it's awkward to get from a `Value` to the corresponding
>> intrinsic.
>>
>> Upgrade path: Instead of using an intrinsic that references a
>> function-local value through an `MDNode`, attach metadata to the
>> corresponding argument or instruction, or to the terminating
>> instruction of the basic block. (This requires new support for
>> attaching metadata to function arguments, which I'll have to add for
>> debug info anyway.)
>
> llvm::Argument attachments are hard
> ===================================
>
> I've been looking at prototyping metadata attachments to
> `llvm::Argument`, which is key to replacing debug info intrinsics.
>
> It's a fairly big piece of new IR, and comes with its own subtle
> semantic decisions. What do you do with metadata attached to arguments
> when you inline a function? If the arguments are remapped to other
> instructions (or arguments), they may have metadata of the same kind
> attached. Does one win? Which one? Or are they merged? What if the
> arguments get remapped to constants? What about when a function is
> cloned?
>
> While the rest of this metadata-is-not-a-value proposal is effectively
> NFC, this `Argument` part could introduce problems. If I rewrite debug
> info intrinsics as argument attachments and then immediately split
> `Metadata` from `Value`, any semantic subtleties will be difficult to
> diagnose in the noise of the rest of the changes.
>
> While I was looking at this as a shortcut to avoid porting
> function-local metadata, I think it introduces more points of failure
> than problems it solves.

One thing to consider is that there are cases were we describe function
arguments without referencing the argument in the intrinsic at all.
Currently, if you compile

  $ cat struct.c
  struct s { int a; int b; };
  int foo(struct s s1) { return s1.a; }

  $ clang -g -O1 -arch x86_64 -S -emit-llvm struct.c

we cannot preserve the debug info for the argument so it is turned into an
intrinsic describing an undef value.

  ; Function Attrs: nounwind readnone ssp uwtable
  define i32 @foo(i64 %s1.coerce) #0 {
  entry:
    %s1.sroa.0.0.extract.trunc = trunc i64 %s1.coerce to i32
    tail call void @llvm.dbg.declare(metadata !18, metadata !14, metadata
!19), !dbg !20
    ret i32 %s1.sroa.0.0.extract.trunc, !dbg !21
  }

  !18 = metadata !{%struct.s* undef}

I'm a little confused by this example - ideally in this case we wouldn't
lose the variable, it's right there and we should reference it. But, yes,
that'll take your SROA patch, etc.

Note that it is critical for this DIVariable to make it into the debug
info even if it is undefined, or the function arguments won’t match the
function signature.

In theory, at -O0 we should never "not have" an argument to attach
something to, right? (that'd require DAE to run) and above -O0 we store the
variable list so we can't lose variables anyway.

Should we just aim for that? How far off are we before that's a practical
goal (so we can avoid having to add in an intrinsic for this special case
that we plan/hope/know will go away eventually). Alternatively, should we
just turn on the variable list even at -O0, I know some other bugs that
would 'fix' (fix in the sense of at least providing the right signature -
but still be cases of missing variable location info).

- David