Module maps, __FILE__ and lazy stat'ing of headers

Hi Richard,

while working with module maps for layering checks a problem with FILE we noticed triggered a couple of questions.

The problem is that FILE uses the path of the first ‘stat’ call (as that is how FileManager::getFile() works).
Currently we parse all module maps at the beginning of the TU, and stat all headers while doing so; there is a FIXME in ModuleMapParser::parseHeaderDecl that reminds us this might not be a good idea, and that we’d want to somehow lazily stat the headers anyway.
Regardless, currently the first ‘stat’ to a header when using strict-decluse is usually the way the header is referenced by its module’s module map; this is often very different from what the user expects, and certainly different from how this used to work (using -I relative path via which a header is usually found).

There are multiple possible solutions, but the first one we thought about was to fix the header stat’ing FIXME, which would naturally solve this problem by pushing the first stat of the header after the stat from the #include.
The problem with that approach is:

  1. We currently have some places that look up a module for a header; we at least need to resolve the headers of all direct dependencies of the module map; I’m unsure what the effect of only finding the headers of the current module and its direct dependencies would be.
  2. We need to find a place in the code at which to resolve / stat all headers in the current module map (and its direct dependencies); this needs to be after preprocessing (thus we’d need to delay diags related to modules while parsing #include directives). It seems to me like this would allow us to also resolve headers only for the current module and its direct deps, at least if my theory is correct that preprocessing would happen once per file per module being built (even on recursive module build invocations)

After looking into it, it seems like this is significant effort, so we’re somewhat hoping to find an intermediate solution that is less effort and allows us to fix the above problems together with the work on full modules.

Thoughts?
/Manuel

+Ben

Hi Richard,

while working with module maps for layering checks a problem with __FILE__ we noticed triggered a couple of questions.

The problem is that __FILE__ uses the path of the first 'stat' call (as that is how FileManager::getFile() works).

I was thinking about this a while ago and independently of this issue I would really like to change this behaviour at some point. The name of a file is a property of how you look it up, not an intrinsic part of the file itself.

Currently we parse all module maps at the beginning of the TU, and stat all headers while doing so; there is a FIXME in ModuleMapParser::parseHeaderDecl that reminds us this might not be a good idea, and that we'd want to somehow lazily stat the headers anyway.
Regardless, currently the first 'stat' to a header when using strict-decluse is usually the way the header is referenced by its module's module map; this is often very different from what the user expects, and certainly different from how this used to work (using -I relative path via which a header is usually found).

Just to make sure I understand the issue: the file name is correct, it’s just showing up relative to the module map file rather than relative to an include path?

>
> Hi Richard,
>
> while working with module maps for layering checks a problem with
__FILE__ we noticed triggered a couple of questions.
>
> The problem is that __FILE__ uses the path of the first 'stat' call (as
that is how FileManager::getFile() works).

I was thinking about this a while ago and independently of this issue I
would really like to change this behaviour at some point. The name of a
file is a property of how you look it up, not an intrinsic part of the file
itself.

I agree. We have incorrectly conflated the notion of the file identity
(inode) with the directory entriy, and we can't tell the two apart. This
leads to weird behavior in a number of places. For instance:

a/
  x.h: #include "y.h"
  y.h: int a = 0;
b/
  x.h: symlink to a/x.h
  y.h: int b = 0;
main.c:
  #include "a/x.h"
  #include "b/x.h"
  int main() { return a + b; }

On a correct compiler, this would work. For clang, it fails, because
b/x.h's #include finds a/y.h, because we use the path by which we first
found x.h as the One True Path to x.h. (This also leads to wrong __FILE__,
etc.)

I tried fixing this ~2 years ago by splitting FileEntry into separate
dentry and inode classes, but this rapidly snowballed and exposed the same
design error being made in various other components. We should fix this,
and it seems the right way to address the problem here, but it's a big task.

Interesting. My motivation was keeping track of virtual and “real” paths for the VFS, which is a special case of the above. Maybe we can sink the dentry/inode down to the VFS layer eventually? Getting symlinks and “…” entries to work in the VFS make a lot more sense when we have explicit dentries rather than inferring from the file path.

I’d be interested in what issues you ran into here if you remember.

I’d be interested in taking a crack at this, but it’s probably not something I could focus on for a while. If someone else is eager, I’d still be happy with pitching in to review or doing VFS bits or whatever.

Ben

Hi Richard,

while working with module maps for layering checks a problem with FILE we noticed triggered a couple of questions.

The problem is that FILE uses the path of the first ‘stat’ call (as that is how FileManager::getFile() works).

I was thinking about this a while ago and independently of this issue I would really like to change this behaviour at some point. The name of a file is a property of how you look it up, not an intrinsic part of the file itself.

I agree. We have incorrectly conflated the notion of the file identity (inode) with the directory entriy, and we can’t tell the two apart. This leads to weird behavior in a number of places. For instance:

a/
x.h: #include “y.h”
y.h: int a = 0;
b/
x.h: symlink to a/x.h
y.h: int b = 0;
main.c:
#include “a/x.h”
#include “b/x.h”
int main() { return a + b; }

On a correct compiler, this would work. For clang, it fails, because b/x.h’s #include finds a/y.h, because we use the path by which we first found x.h as the One True Path to x.h. (This also leads to wrong FILE, etc.)

I tried fixing this ~2 years ago by splitting FileEntry into separate dentry and inode classes, but this rapidly snowballed and exposed the same design error being made in various other components.

Interesting. My motivation was keeping track of virtual and “real” paths for the VFS, which is a special case of the above. Maybe we can sink the dentry/inode down to the VFS layer eventually? Getting symlinks and “…” entries to work in the VFS make a lot more sense when we have explicit dentries rather than inferring from the file path.

I’d be interested in what issues you ran into here if you remember.

I’ll see if I can dig out my patch tomorrow.

>
>
>>
>>>
>>>
>>> >
>>> > Hi Richard,
>>> >
>>> > while working with module maps for layering checks a problem with
__FILE__ we noticed triggered a couple of questions.
>>> >
>>> > The problem is that __FILE__ uses the path of the first 'stat' call
(as that is how FileManager::getFile() works).
>>>
>>> I was thinking about this a while ago and independently of this issue
I would really like to change this behaviour at some point. The name of a
file is a property of how you look it up, not an intrinsic part of the file
itself.
>>
>>
>> I agree. We have incorrectly conflated the notion of the file identity
(inode) with the directory entriy, and we can't tell the two apart. This
leads to weird behavior in a number of places. For instance:
>>
>> a/
>> x.h: #include "y.h"
>> y.h: int a = 0;
>> b/
>> x.h: symlink to a/x.h
>> y.h: int b = 0;
>> main.c:
>> #include "a/x.h"
>> #include "b/x.h"
>> int main() { return a + b; }
>>
>> On a correct compiler, this would work. For clang, it fails, because
b/x.h's #include finds a/y.h, because we use the path by which we first
found x.h as the One True Path to x.h. (This also leads to wrong __FILE__,
etc.)
>>
>> I tried fixing this ~2 years ago by splitting FileEntry into separate
dentry and inode classes, but this rapidly snowballed and exposed the same
design error being made in various other components.
>
>
> Interesting. My motivation was keeping track of virtual and “real”
paths for the VFS, which is a special case of the above. Maybe we can sink
the dentry/inode down to the VFS layer eventually? Getting symlinks and
“..” entries to work in the VFS make a lot more sense when we have explicit
dentries rather than inferring from the file path.
>
> I’d be interested in what issues you ran into here if you remember.

I'll see if I can dig out my patch tomorrow.

Another question is whether we can think of a workaround in the meantime...

Attached is my WIP patch from (as it turns out) over 2 years ago. My (very)
hazy memories were that we had quite a few different places where people
were using FileEntry*s for things, and neither a dentry nor an inode seemed
like the "right" thing. I don't remember any more details than that. There
were also places where we would need to just make a decision, such as: what
should #pragma once use as its key? (I think dentry is the right answer
here, since the same file found in different directories might mean
different things, but that answer may break some people who use #pragma
once and don't also have include guards. Conversely, it fixes some builds
on content-addressed file systems.)

clang-6.diff (89.8 KB)

As this is blocking us, I’m taking a stab at it…

Richard, as this patch doesn’t apply cleanly any more - can you outline what you were trying to do in SourceManager.h? I have a hard time figuring out which places that use FileEntry now need to use FileName…

Cheers,
/Manuel

As this is blocking us, I’m taking a stab at it…

One thing that stood out to me in Richard’s patch:

  • Returning FileName pointers invites users to rely on the identity of the FileName object, which won’t work on case-insensitive file systems and is probably a bad idea in general. I suggest we wrap FileName * in a “FileRef” object that prevents comparisons or possibly forwards them to comparing the inodes.

Ben

As this is blocking us, I'm taking a stab at it…

One thing that stood out to me in Richard's patch:

* Returning FileName pointers invites users to rely on the identity of the
FileName object, which won’t work on case-insensitive file systems and is
probably a bad idea in general. I suggest we wrap FileName * in a
“FileRef” object that prevents comparisons or possibly forwards them to
comparing the inodes.

I don't agree: there are some uses where we really want to compare the
canonicalized FileName pointers. For instance, this is the right behavior
for #pragma once, and for various things in HeaderSearch. (When looking up
a header relative to a file, it should be the FileName that is the cache
key, not the FileEntry.)

Perhaps the best way to deal with case-insensitive file systems is to give
paths that differ by case the same FileName object, since they are the same
dentry (which is the notion we're trying to capture here). Likewise, on
Windows, the short and long forms of the same file name should have the
same FileName object. We may be able to make that change without
introducing a split between FileName and FileEntry; the downside would be
that we may potentially load the same inode multiple times if it's found by
different paths (relative paths being the most likely offender).

Ben

SGTM.

Ok, I might be dense here, but I’m still missing the picture :slight_smile:

So, I see that the idea is to have FileName that basically encapsulates which path/name combination we found the file under, and a FileEntry, which maps to the inode.

Questions:

  1. do we want to replace all uses of FileEntry with FileName, or are there cases where we’re only interested in the inodes (for example in SourceManager).
  2. how would we detect whether we are on a case insensitive file system?

Cheers,
/Manuel

Ok, I might be dense here, but I’m still missing the picture :slight_smile:

So, I see that the idea is to have FileName that basically encapsulates which path/name combination we found the file under, and a FileEntry, which maps to the inode.

Questions:

  1. do we want to replace all uses of FileEntry with FileName, or are there cases where we’re only interested in the inodes (for example in SourceManager).
  1. how would we detect whether we are on a case insensitive file system?

We can check sensitivity lazily on a per-file basis. If a lookup misses the SeenFileEntries cache, but hits a file we already have a UniqueID for, we can then check whether we already have a FileName for that file that is the same as our lookup up to case-sensitivity. To do that, we can either store a list of FileName pointers in the FileEntry, or we can add another map to the FileManager to look it up. Either way, the common-case will be one FIleName per FileEntry

Ben

Richard came up with the idea of updating the FileEntry’s name on each getFile, so we’d get the last value instead of the first (which would solve the module map related problem).
Unfortunately this breaks:
http://llvm.org/bugs/show_bug.cgi?id=8974

Richard came up with the idea of updating the FileEntry's name on each
getFile, so we'd get the last value instead of the first (which would solve
the module map related problem).
Unfortunately this breaks:
http://llvm.org/bugs/show_bug.cgi?id=8974

I've got what I think is a "real fix" for the regression. With that fix,
just resetting .Name on each stat works. Will send out code review for the
regression fix in a moment.