debugloc metadata variation

Just working on some of the gmlt+fission debug info stuff and I came across a comment that might be relevant to reducing the number of distinct debugloc metadata nodes:

“or some sub-optimal metadata that
// isn’t structurally identical (see: file path/name info from clang, which
// includes the directory of the cpp file being built, even when the file name
// is absolute (such as an <> lookup header)))”

Seems that the file path/name isn’t well canonicalized so as to allow metadata level merging when linking. Might be helpful to figure out that issue at some point.

  • David

Just working on some of the gmlt+fission debug info stuff and I came
across a comment that might be relevant to reducing the number of distinct
debugloc metadata nodes:

"or some sub-optimal metadata that
  // isn't structurally identical (see: file path/name info from clang,
which
  // includes the directory of the cpp file being built, even when the
file name
  // is absolute (such as an <> lookup header)))"

Seems that the file path/name isn't well canonicalized so as to allow
metadata level merging when linking. Might be helpful to figure out that
issue at some point.

I wonder how much memory is spent on redundant path prefixes. E.g.

/really/really/really/long/path/to/foo.h
/really/really/really/long/path/to/bar.h

-- Sean Silva

Just working on some of the gmlt+fission debug info stuff and I came across a comment that might be relevant to reducing the number of distinct debugloc metadata nodes:

"or some sub-optimal metadata that
  // isn't structurally identical (see: file path/name info from clang, which
  // includes the directory of the cpp file being built, even when the file name
  // is absolute (such as an <> lookup header)))"

Seems that the file path/name isn't well canonicalized so as to allow metadata level merging when linking. Might be helpful to figure out that issue at some point.

Incidentally I worked on an issue last week where the line table would get entries representing the same file, but where the file/dir split wasn’t done at the same place. I have a patch that remerges them at emission, but I was planing on investigating more the source of the duplication before I submit anything.

The cases I’ve seen have one duplicated entry though, nothing that could have a visible impact on memory consumption.

Fred

(sorry for the duplicate Fred, I failed at reply-all the first time)

This might be fixed by making `MDFile` (or `DIFile`) first-class. We
just need to canonicalize on creation.

    class MDFile {
    public:
      // Split the path at the right place.
      MDFile *get(LLVMContext &C, StringRef Path);

      // Convenience for callers, but the path gets canonicalized.
      MDFile *get(LLVMContext &C, StringRef File, StringRef Dir);

      StringRef getFilename() const;
      StringRef getDirectory() const;
    };

Note that whether we continue to use `MDString` under the hood is an
implementation detail.

However, path canonicalization (in particular, eating "..") requires
a `stat()` to do correctly on *NIX, so the implementation would have
to cache lookups. Doesn't seem hard though.

Seems mostly orthogonal to MDFile being first class - I imagine the logic
would be implemented just as well in DIBuilder::createFile and moved around
wherever it happens to go when MDFile/DIFile becomes first class. (& it'll
have to be done on construction/during frontend work (preaching to the
choir, but just saying it to make sure I'm understanding, etc) because the
files might not be around to run 'stat' on by the time we're processing IR
- doesn't mean the code for it has to be written in the frontend, it can
still be in a utility like DIBuilder, so long as we never have to conjure
up new DIFiles during optimizations (or if we do so, they don't need to be
canonicalized, which sounds plausible))

- David

>
> (sorry for the duplicate Fred, I failed at reply-all the first time)
>
>
> >
> > Just working on some of the gmlt+fission debug info stuff and I came across a comment that might be relevant to reducing the number of distinct debugloc metadata nodes:
> >
> > "or some sub-optimal metadata that
> > // isn't structurally identical (see: file path/name info from clang, which
> > // includes the directory of the cpp file being built, even when the file name
> > // is absolute (such as an <> lookup header)))"
> >
> > Seems that the file path/name isn't well canonicalized so as to allow metadata level merging when linking. Might be helpful to figure out that issue at some point.
>
> Incidentally I worked on an issue last week where the line table would get entries representing the same file, but where the file/dir split wasn’t done at the same place. I have a patch that remerges them at emission, but I was planing on investigating more the source of the duplication before I submit anything.
>
> The cases I’ve seen have one duplicated entry though, nothing that could have a visible impact on memory consumption.
>
> So the particular case where I think this arises in a way that might be measurable is if you have a build system that changes directories to build subprojects (like our make build system, if I understand correctly - but not our cmake build system, again, if I understand correctly):
>
> imagine a simple directory layout:
>
> include/
> foo.h
> lib/
> a/
> a.cpp // includes foo.h and calls one inline function from it (or uses a type, etc) from some external function a()
> b/
> b.cpp // does the same thing as a.cpp, but with its own external function, b()
>
> if you run "clang++ -emit-llvm -S -Iinclude -c lib/a/a.cpp lib/b/b.cpp -g" you get two .ll files both with the obvious:
> !9 = metadata !{metadata !"include/foo.h", metadata !"/tmp/dbginfo/pathtest"}
> But if you do this instead: "cd lib/a; clang++ -emit-llvm -S -I../../include -c a.cpp -g; cd ../../lib/b; clang++ -emit-llvm -S -I../../include -c b.cpp -g" you get two different nodes:
>
> !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/b"}
>
> !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/a"}
> and now you're left with a situation in which almost all the metadata is different and any place you were relying on the standard metadata uniquing you won't get it :frowning:
>

This might be fixed by making `MDFile` (or `DIFile`) first-class. We
just need to canonicalize on creation.

    class MDFile {
    public:
      // Split the path at the right place.
      MDFile *get(LLVMContext &C, StringRef Path);

      // Convenience for callers, but the path gets canonicalized.
      MDFile *get(LLVMContext &C, StringRef File, StringRef Dir);

      StringRef getFilename() const;
      StringRef getDirectory() const;
    };

Note that whether we continue to use `MDString` under the hood is an
implementation detail.

However, path canonicalization (in particular, eating "..") requires
a `stat()` to do correctly on *NIX, so the implementation would have
to cache lookups. Doesn't seem hard though.

Seems mostly orthogonal to MDFile being first class

You're right.

- I imagine the logic would be implemented just as well in DIBuilder::createFile and moved around wherever it happens to go when MDFile/DIFile becomes first class. (& it'll have to be done on construction/during frontend work

Yup, has to be done when the `MDFile` is first constructed (and only
then).