Questions before moving the new debug info hierarchy into place

I'm getting close to executing the transition to the new debug info
hierarchy. For reference, I've attached two WIP patches (which would be
squashed before commit) and the WIP upgrade script I'm using.

    - transition-code.patch: Change the `DIDescriptor` hierarchy to
      lightweight wrappers around the subclasses of `DebugNode` (instead
      of rather heavier wrappers around `MDTuple`), and update
      `DIBuilder` to generate the right things.
    - transition-testcases.patch: Upgrade of testcases, entirely
      generated from the upgrade script (so far).
    - upgrade-specialized-nodes.sh: Script to upgrade LLVM assembly.

(Feel free to have a look, but I haven't updated LangRef, I don't quite
have `make check` passing, and I haven't even started on `make
check-clang` (I imagine that'll all be done by hand).)

There are two outstanding issues I'd like some feedback on.

Pretty-printing the flags

transition-code.patch (325 Bytes)

transition-testcases.patch (1.67 MB)

upgrade-specialized-nodes.sh (48.5 KB)

I'm getting close to executing the transition to the new debug info
hierarchy. For reference, I've attached two WIP patches (which would be
squashed before commit) and the WIP upgrade script I'm using.

   - transition-code.patch: Change the `DIDescriptor` hierarchy to
     lightweight wrappers around the subclasses of `DebugNode` (instead
     of rather heavier wrappers around `MDTuple`), and update
     `DIBuilder` to generate the right things.
   - transition-testcases.patch: Upgrade of testcases, entirely
     generated from the upgrade script (so far).
   - upgrade-specialized-nodes.sh: Script to upgrade LLVM assembly.

(Feel free to have a look, but I haven't updated LangRef, I don't quite
have `make check` passing, and I haven't even started on `make
check-clang` (I imagine that'll all be done by hand).)

There are two outstanding issues I'd like some feedback on.

Pretty-printing the flags

I've noticed the `flags:` field is *harder* to read in the new assembly:

   !MDDerivedType(flags: 16384, ...)

than the pretty-printed comments in the old:

   !{!"...\\0016384", ...} ; ... [public] [rvalue reference]

I don't want to regress here.

In `DIDescriptor`, the flags are described in an enum bitfield:

   FlagAccessibility = 1 << 0 | 1 << 1,
   FlagPrivate = 1,
   FlagProtected = 2,
   FlagPublic = 3,
   FlagFwdDecl = 1 << 2,
   FlagAppleBlock = 1 << 3,
   FlagBlockByrefStruct = 1 << 4,
   FlagVirtual = 1 << 5,
   FlagArtificial = 1 << 6,
   FlagExplicit = 1 << 7,
   FlagPrototyped = 1 << 8,
   FlagObjcClassComplete = 1 << 9,
   FlagObjectPointer = 1 << 10,
   FlagVector = 1 << 11,
   FlagStaticMember = 1 << 12,
   FlagLValueReference = 1 << 13,
   FlagRValueReference = 1 << 14

I think the right short-term solution is to use these names directly in
assembly, giving us:

   !MDDerivedType(flags: FlagPublic | FlagRValueReference, ...)

This is easy to implement and easy to CHECK against.

Sound good?

(Eventually, I'd like to use the `DW_AT` symbols that each of these
corresponds to, but `FlagStaticMember` doesn't seem to line up with any
such `DW_AT` so that will take some refactoring (and I don't think it
makes sense for that to block moving the hierarchy into place).)

I support this proposal under the condition that it is only an intermediate step.

What I would like to evolve this to is a more generic infrastructure were we have a list of “trivial" attributes (= attributes that don’t need any special backend support) so we could write things like
   MDCompositeType(attributes: [DW_AT_artificial, DW_AT_rvalue_reference, DW_AT_public], ...)

perhaps even
   MDCompositeType(attributes: [DW_AT_vtable_elem_location(DIExpression(DW_OP_constu(3)))], ...)
?

[ Looks nice, but it isn’t perfect, of course. For example: DW_AT_rvalue_reference is technically only available in DWARF5. Should the frontend then have to know/care about these details, or should the backend perform a lowering from a DWARF5-based IR to whatever the module flags requested? ]

-- adrian

I agree with Adrian here, and SGTM.

Thanks for all of your very hard work Duncan!

-eric

I'm getting close to executing the transition to the new debug info
hierarchy. For reference, I've attached two WIP patches (which would be
squashed before commit) and the WIP upgrade script I'm using.

    - transition-code.patch: Change the `DIDescriptor` hierarchy to
      lightweight wrappers around the subclasses of `DebugNode` (instead
      of rather heavier wrappers around `MDTuple`), and update
      `DIBuilder` to generate the right things.
    - transition-testcases.patch: Upgrade of testcases, entirely
      generated from the upgrade script (so far).
    - upgrade-specialized-nodes.sh: Script to upgrade LLVM assembly.

(Feel free to have a look, but I haven't updated LangRef, I don't quite
have `make check` passing, and I haven't even started on `make
check-clang` (I imagine that'll all be done by hand).)

There are two outstanding issues I'd like some feedback on.

Pretty-printing the flags

I've noticed the `flags:` field is *harder* to read in the new assembly:

    !MDDerivedType(flags: 16384, ...)

than the pretty-printed comments in the old:

    !{!"...\\0016384", ...} ; ... [public] [rvalue reference]

I don't want to regress here.

In `DIDescriptor`, the flags are described in an enum bitfield:

    FlagAccessibility = 1 << 0 | 1 << 1,
    FlagPrivate = 1,
    FlagProtected = 2,
    FlagPublic = 3,
    FlagFwdDecl = 1 << 2,
    FlagAppleBlock = 1 << 3,
    FlagBlockByrefStruct = 1 << 4,
    FlagVirtual = 1 << 5,
    FlagArtificial = 1 << 6,
    FlagExplicit = 1 << 7,
    FlagPrototyped = 1 << 8,
    FlagObjcClassComplete = 1 << 9,
    FlagObjectPointer = 1 << 10,
    FlagVector = 1 << 11,
    FlagStaticMember = 1 << 12,
    FlagLValueReference = 1 << 13,
    FlagRValueReference = 1 << 14

I think the right short-term solution is to use these names directly in
assembly, giving us:

    !MDDerivedType(flags: FlagPublic | FlagRValueReference, ...)

This is easy to implement and easy to CHECK against.

Sound good?

(Eventually, I'd like to use the `DW_AT` symbols that each of these
corresponds to, but `FlagStaticMember` doesn't seem to line up with any
such `DW_AT` so that will take some refactoring (and I don't think it
makes sense for that to block moving the hierarchy into place).)

Merging the two types of files

In the old format, we have two types of files -- an untagged file node,
and a DIFile/DW_TAG_file_type/0x29 which references the untagged node.

    !0 = !{!"path/to/file", !"/path/to/dir"}
    !1 = !{!"0x29", !0}

In the actual metadata graph, most file references use !0, but in
DIBuilder !1 is always passed in and the !0 is extracted from it.

I've been planning to merge these into:

    !1 = !MDFile(filename: "path/to/file", directory: "/path/to/dir")

Anyone see a problem with that? Why?

If the strings are deduplicated elsewhere, that should be fine. (I think I
made the change originally to pull out the names because of the duplication
I saw in many constructsn needing to reference the file/directory and they
all contained the same file/directory)

I take it you'll have other constructs (I think all scopes have a
file/directory) reference the MDFile directly?

No specific comments on the approach which seems fine. One question though:

Speaking of deduplication of filenames. I think we discussed that in the early stages of your work, but I just wanted to make sure I remember correctly: the new debug hierarchy will allow to implement specific uniquing behavior, right? So we will be able to tweak MDFile to unique:

!MDFile(filename: “path/to/file”, directory: “/path/to/dir”)
!MDFile(filename: “to/file”, directory: “/path/to/dir/path”)

into the same object? I did some work in this area and in the current form it’s not really possible to do cleanly (don’t remember the details here, but I know punted it till we get smart uniquing capability).

Thanks!
Fred

Okay, thanks Adrian and Eric. I'll move forward with the Flag* constants.

I'm getting close to executing the transition to the new debug info
hierarchy. For reference, I've attached two WIP patches (which would be
squashed before commit) and the WIP upgrade script I'm using.

    - transition-code.patch: Change the `DIDescriptor` hierarchy to
      lightweight wrappers around the subclasses of `DebugNode` (instead
      of rather heavier wrappers around `MDTuple`), and update
      `DIBuilder` to generate the right things.
    - transition-testcases.patch: Upgrade of testcases, entirely
      generated from the upgrade script (so far).
    - upgrade-specialized-nodes.sh: Script to upgrade LLVM assembly.

(Feel free to have a look, but I haven't updated LangRef, I don't quite
have `make check` passing, and I haven't even started on `make
check-clang` (I imagine that'll all be done by hand).)

There are two outstanding issues I'd like some feedback on.

Pretty-printing the flags

I've noticed the `flags:` field is *harder* to read in the new assembly:

    !MDDerivedType(flags: 16384, ...)

than the pretty-printed comments in the old:

    !{!"...\\0016384", ...} ; ... [public] [rvalue reference]

I don't want to regress here.

In `DIDescriptor`, the flags are described in an enum bitfield:

    FlagAccessibility = 1 << 0 | 1 << 1,
    FlagPrivate = 1,
    FlagProtected = 2,
    FlagPublic = 3,
    FlagFwdDecl = 1 << 2,
    FlagAppleBlock = 1 << 3,
    FlagBlockByrefStruct = 1 << 4,
    FlagVirtual = 1 << 5,
    FlagArtificial = 1 << 6,
    FlagExplicit = 1 << 7,
    FlagPrototyped = 1 << 8,
    FlagObjcClassComplete = 1 << 9,
    FlagObjectPointer = 1 << 10,
    FlagVector = 1 << 11,
    FlagStaticMember = 1 << 12,
    FlagLValueReference = 1 << 13,
    FlagRValueReference = 1 << 14

I think the right short-term solution is to use these names directly in
assembly, giving us:

    !MDDerivedType(flags: FlagPublic | FlagRValueReference, ...)

This is easy to implement and easy to CHECK against.

Sound good?

(Eventually, I'd like to use the `DW_AT` symbols that each of these
corresponds to, but `FlagStaticMember` doesn't seem to line up with any
such `DW_AT` so that will take some refactoring (and I don't think it
makes sense for that to block moving the hierarchy into place).)

Merging the two types of files

In the old format, we have two types of files -- an untagged file node,
and a DIFile/DW_TAG_file_type/0x29 which references the untagged node.

    !0 = !{!"path/to/file", !"/path/to/dir"}
    !1 = !{!"0x29", !0}

In the actual metadata graph, most file references use !0, but in
DIBuilder !1 is always passed in and the !0 is extracted from it.

I've been planning to merge these into:

    !1 = !MDFile(filename: "path/to/file", directory: "/path/to/dir")

Anyone see a problem with that? Why?

If the strings are deduplicated elsewhere, that should be fine. (I think I made the change originally to pull out the names because of the duplication I saw in many constructsn needing to reference the file/directory and they all contained the same file/directory)

Yup, they'll be de-duped via `MDFile`.

I take it you'll have other constructs (I think all scopes have a file/directory) reference the MDFile directly?

I'll make it so all `file:` references point to an `MDFile`; so
far this is only true for `MDCompileUnit` (the others point to
the untagged node). There are some places (at least in
testcases) where `MDFile` is being used in a `scope:` reference;
I'll leave those unchanged (although IIUC, Adrian thinks a file
should never be used as a scope -- you always have a better
option (compile unit, namespace, etc.) -- so maybe that'll be
changing eventually... we just need be mindful of type uniquing).

IIRC, there was some talk about how the DWARF spec requires that
references to a `directory` need to be the CWD of the compiler.

It's not clear to me *which* references need that, but I guess
we need to sort out exactly what the requirements are so we know
which way to canonicalize things. I wonder if this only matters
for DW_TAG_compile_unit? If so, we could canonicalize freely
everywhere else. (For example, we could store a `path:` node
everywhere (dropping the filename/directory distinction, or
canonicalizing it to basename/dirname, etc.), and add a `cwd:`
to the compile unit.)

David also pointed out (in a previous thread about this) that
the frontend (or `DIBuilder`) might be the right location for
canonicalization. Certainly, we couldn't canonicalize `..`
references without access to the filesystem (at least not on
POSIX platforms), but maybe we don't care about that anyway?

Speaking of deduplication of filenames. I think we discussed that in the early stages of your work, but I just wanted to make sure I remember correctly: the new debug hierarchy will allow to implement specific uniquing behavior, right? So we will be able to tweak MDFile to unique:

!MDFile(filename: “path/to/file”, directory: “/path/to/dir”)
!MDFile(filename: “to/file”, directory: “/path/to/dir/path”)

into the same object? I did some work in this area and in the current form it’s not really possible to do cleanly (don’t remember the details here, but I know punted it till we get smart uniquing capability).

IIRC, there was some talk about how the DWARF spec requires that
references to a directory need to be the CWD of the compiler.

If you refer to a conversation we had, I’m pretty sure I was referring to this text in the standard about the directories listed in the line table:

“Entries in this sequence describe each path that was searched for included source files in this compilation. (The paths include those directories specified explicitly by the user for the compiler to search and those the compiler searches without explicit direction.) Each path entry is either a full path name or is relative to the current directory of the compilation.”

I think that at the time I had interpreted that has “if the user passes -I…/include to the compiler, le line table should list …/include as a directory in the line table”. I think my interpretation was too strict though and that we could canonicalize the file/dir names as we’d like.

The compilation dir will always be there as a special entry. It’s the implicit entry 0 in the directory list, but it’s value is not stored in the line table, it is a normal string that is referenced by the TAG_compile_unit DIE. If it is extracted from an MDFile, we need to make sure this one doesn’t get mangled, but I think this is the only one we really need to preserve.

It’s not clear to me which references need that, but I guess
we need to sort out exactly what the requirements are so we know
which way to canonicalize things. I wonder if this only matters
for DW_TAG_compile_unit? If so, we could canonicalize freely
everywhere else. (For example, we could store a path: node
everywhere (dropping the filename/directory distinction, or
canonicalizing it to basename/dirname, etc.), and add a cwd:
to the compile unit.)

As stated above, I think we are free to canonicalize as we wish.

David also pointed out (in a previous thread about this) that
the frontend (or DIBuilder) might be the right location for
canonicalization. Certainly, we couldn’t canonicalize ..
references without access to the filesystem (at least not on
POSIX platforms), but maybe we don’t care about that anyway?

What I attempted to do in the past was exactly that, canonicalize ‘…’ in the paths (The lldb folks would have loved to be able to consider file identifiers as unique in the line table).

Fred

David also pointed out (in a previous thread about this) that
the frontend (or `DIBuilder`) might be the right location for
canonicalization. Certainly, we couldn't canonicalize `..`
references without access to the filesystem (at least not on
POSIX platforms), but maybe we don't care about that anyway?

What I attempted to do in the past was exactly that, canonicalize ‘..’ in the paths

On POSIX you need access to the filesystem to canonicalize
`..` (if `./dir` is a symlink, then `./dir/../` does not
necessarily equal `./`), so any such canonicalization should
IMO be done outside of `MDFile`. But yeah, we should revisit
this soon.

(The lldb folks would have loved to be able to consider file identifiers as unique in the line table).

This further requires resolving *all* symlinks in the path.
Certainly doable... but it's not just string manipulation.

Yeah, both problems are solved by a call to realpath. It’s quite heavy though as it has to at least stat every component of the path. If we do that we’ll need to make sure it’s not too costly, or resort to some sort of caching.

Committed the changes to `MDFile` in r230057 (`DIFile` and testcase
upgrading will be rolled into the "transition" patch).

Okay, thanks Adrian and Eric. I'll move forward with the Flag* constants.

Committed in r230095, r230100, r230105, r230107, r230108, and r230111.