case-insensitive #include warning

Hi all,

I have an initial cut at patch that issues a warning when a file is #included on a case-insensitive file system that would not be found on a case-sensitive system. Is there interest?

Since this is a hard problem to solve perfectly, I have opted for a strategy that is efficient and conservative about issuing diagnostics. That is, it shouldn’t give false positives, but it will fail to diagnose some non-portable paths. On *nix systems, the low-level APIs that stat and open files get an extra call to ::realpath, and the result is cached along with the rest of the file metadata. On Windows, I use a combination of GetFullPathName and GetLongPathName to get the same effect. (I don’t believe that’s guaranteed to get the physical name including case, but it seems to mostly work in my testing.)

Due to how I compare path components, a relative path like “NoTtHeRiGhTcAsE/…/correctly-cased.h” will not be diagnosed, but “…/NoTtHeRiGhTcAsE/correctly-cased.h” will be. Catching more cases requires many more round trips to the disk, which I wanted to avoid.

Feedback welcome.

Thanks,
Eric

clang.patch (25.4 KB)

llvm.patch (2.7 KB)

Forgot the tests in the patch. Sorry, new patch attached.

llvm.patch (2.7 KB)

clang.patch (27.8 KB)

Hi Eric,

This would be a hugely welcomed feature, but have you done any performance analysis of this? The preprocessor and the data structures you are touching are very sensitive.

You can stress test the preprocessor by using the "clang -cc1 -Eonly” mode. If you’re on a mac, I’d recommend timing <Cocoa/Cocoa.h>

-Chris

Out of curiosity have you tried this with some of the more interesting upper/lower case pairs like the turkish ‘İ’?

It sounds like the way you’re achieving this should allow this to work, but its worthwhile to try it.

Reposting this to the list (replied to sender by mistake):

Jonathan Coe: I wonder if a clang-tidy check might be a good place for a more exhaustive check?

Eric N: That’s not a bad idea, thanks. I would still like to leave this simple check in to catch the majority of the cases, since not everybody uses clang-tidy. Do you have thoughts on the patch?

I can say that this almost certainly does not work for non-ascii since it just uses StringRef::equals_lower. Is there any proper locale-sensitive string comparison routines in llvm that I can use?

Eric

A safer way is to use an OS API that gives you back the exact name of the file. On windows GetFinalPathNameByHandle should work, I don’t know mac well enough to suggest an API there. At that point you could do a case sensitive string compare on the name and warn if there are differences.

This also prevents all the problems of locale specifics since tolower means different things in different locales. So your warning could work fine in en-us but not in some other country. Anything relying on tolower/toupper is guaranteed to have internationalization bugs.

John,

The strategy you describe is basically the strategy I’m using. I get the actual name of the file on disk when opening it for the first type and stuff the result in the File cache. Then when processing the #include, I check the components of the user’s #include path with the actual path on disk. If any path component differs from the corresponding component in the “real” path, the warning is triggered – unless they differ by more than case. In that case, I assume it’s a simlink and conservatively don’t issue the diagnostic. That last bit is the only place where I use equals_lower. This would yield a false positives only when (a) two path components are different according to StringRef::equal and (b) they are the same according to StringRef::equal_lower. That /seems/ unlikely to me, but I’m not positive. It seems more likely that this will fail to diagnose some incorrectly cased paths. Regardless, a locale-sensitive string compare routine sure would be a nice thing to have here.

A #include like <NoTtHeRiGhTcAsE/…/correctly-cased.h> is not flagged since NoTtHeRiGhTcAsE would probably not show up in the “real” path to correctly-cased.h as reported by the OS. So I have nothing to compare NoTtHeRiGhTcAsE against, and the error is not diagnosed.

Re GetFinalPathNameByHandle, that’s a new(er) Windows API. Vista and 2008 Server only. I was reluctant to use it since I don’t know if LLVM targets older OSes. I’m happy to be wrong about that.

Eric

A safer way is to use an OS API that gives you back the exact name of the
file. On windows GetFinalPathNameByHandle should work, I don't know mac
well enough to suggest an API there. At that point you could do a case
sensitive string compare on the name and warn if there are differences.

This also prevents all the problems of locale specifics since tolower means
different things in different locales. So your warning could work fine in
en-us but not in some other country. Anything relying on tolower/toupper is
guaranteed to have internationalization bugs.

Internationalization isn't actually a problem here. The issue is that
some file systems/OSs change the paths you give them and some don't.
If the path you get from directory iteration is memcmp different from
the path used to #include the file, then it isn't going to work on a
"normal" Linux setup. An additional wrinkle is added because this
implementation is also resolving symlinks, and so has to handle that
case also.

- Michael Spencer

John,

The strategy you describe is basically the strategy I'm using. I get the
actual name of the file on disk when opening it for the first type and stuff
the result in the File cache. Then when processing the #include, I check the
components of the user's #include path with the actual path on disk. If any
path component differs from the corresponding component in the "real" path,
the warning is triggered -- unless they differ by more than case. In that
case, I assume it's a simlink and conservatively don't issue the diagnostic.
That last bit is the only place where I use equals_lower. This would yield a
false positives only when (a) two path components are different according to
StringRef::equal and (b) they are the same according to
StringRef::equal_lower. That /seems/ unlikely to me, but I'm not positive.
It seems more likely that this will fail to diagnose some incorrectly cased
paths. Regardless, a locale-sensitive string compare routine sure would be a
nice thing to have here.

A #include like <NoTtHeRiGhTcAsE/../correctly-cased.h> is not flagged since
NoTtHeRiGhTcAsE would probably not show up in the "real" path to
correctly-cased.h as reported by the OS. So I have nothing to compare
NoTtHeRiGhTcAsE against, and the error is not diagnosed.

Re GetFinalPathNameByHandle, that's a new(er) Windows API. Vista and 2008
Server only. I was reluctant to use it since I don't know if LLVM targets
older OSes. I'm happy to be wrong about that.

Eric

LLVM has dropped XP support for running the compiler (not for targeting).

- Michael Spencer

It sounds like you’ve hit >99% accuracy which should be good enough for this feature in my opinion. GetProcAddress will let you know at runtime if the API is available or not, assuming this method is faster than older equivalents.

I just wanted to force a discussion on possible internationalization issues.

Thanks for the suggestion, Chris. Looks like I’ve regressed the speed of the preprocessor by 25%. Yikes. I’ll see what I can do.

Eric

I was working on a similar patch a few years ago (http://lists.llvm.org/pipermail/cfe-dev/2012-July/023090.html is the start of a very long thread). Does realpath really give the real case on unix? My understanding is that it doesn’t. I had to resort to reading the directory listings and comparing entries. And of course readdir is even slower. I could only ever get decent performance by aggressively caching the directory listings. In the end it got too complicated and wasn’t entirely convinced that performance would ever be good enough to warrant the addition. I was mostly interested in Mac though. Maybe it’s a different story on Windows.

Jason

I haven’t been able to successfully mount a case-insensitive file system on my Linux VM, so I haven’t specifically tested this. On Mac, realpath gets the case right, but it’s not needed since it has fcntl/F_GETPATH, which I’ve discovered is faster. In my tests, realpath does not get the right case on Cygwin. On *nixes that have /proc, it’s possible to readlink on /proc/self/fd/$FD once the file is opened, and that should give the properly cased filename. I haven’t done extensive benchmarking, but that is certain to be faster than recursive directory traversal. But not all *nixes have /proc.

TL;DR this is going to be an unreliable warning. Personally, I think that’s fine so long as it doesn’t give false positives and doesn’t slow things down too badly. I’m busy reworking the patch to that end.

\e

On 4/7/16, 11:37 AM, "clattner@apple.com on behalf of Chris Lattner" <

Hi all,

I have an initial cut at patch that issues a warning when a file is
#included on a case-insensitive file system that would not be found on a
case-sensitive system. Is there interest?

Since this is a hard problem to solve perfectly, I have opted for a
strategy that is efficient and conservative about issuing diagnostics. That
is, it shouldn't give false positives, but it will fail to diagnose some
non-portable paths. On *nix systems, the low-level APIs that stat and open
files get an extra call to ::realpath, and the result is cached along with
the rest of the file metadata. […]

I was working on a similar patch a few years ago (
http://lists.llvm.org/pipermail/cfe-dev/2012-July/023090.html
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_pipermail_cfe-2Ddev_2012-2DJuly_023090.html&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=t4UJXwtMcIwGvgOaMqGzrQ&m=c8D9-z1QyfRWobwgBlnojB0yAu4P3Ve2-Wk9Z4vAxjg&s=5yrhYcvssUCiHzENOa2jFtUbhJNW1jTrGLKAnlJLkuQ&e=&gt; is
the start of a very long thread). Does realpath really give the real case
on unix? My understanding is that it doesn’t. I had to resort to reading
the directory listings and comparing entries. And of course readdir is even
slower. I could only ever get decent performance by aggressively caching
the directory listings. In the end it got too complicated and wasn’t
entirely convinced that performance would ever be good enough to warrant
the addition. I was mostly interested in Mac though. Maybe it’s a different
story on Windows.

I haven't been able to successfully mount a case-insensitive file system
on my Linux VM,

This should do it:

  fallocate -l 3g mydisk
  mkfs.vfat mydisk
  mkdir mountdir
  sudo mount mydisk mountdir

I have an initial cut at patch that issues a warning when a file is #included on a case-insensitive file system that would not be found on a case-sensitive system. Is there interest?

Since this is a hard problem to solve perfectly, I have opted for a strategy that is efficient and conservative about issuing diagnostics. That is, it shouldn't give false positives, but it will fail to diagnose some non-portable paths. On *nix
systems, the low-level APIs that stat and open files get an extra call to ::realpath, and the result is cached along with the rest of the file metadata. […]

I was working on a similar patch a few years ago (http://lists.llvm.org/pipermail/cfe-dev/2012-July/023090.html is
the start of a very long thread). Does realpath really give the real case on unix? My understanding is that it doesn’t. [...]

I haven't been able to successfully mount a case-insensitive file system on my Linux VM, so I haven't specifically tested this. On Mac, realpath gets the case right, but it's not needed since it has fcntl/F_GETPATH, which I've discovered is faster. In
my tests, realpath does not get the right case on Cygwin.

And now that I've (finally!) been able to test this on Linux, I can confirm what Jason said: on Linux, realpath does *not* give the correct case. However, most *nixes besides MacOS have /proc, and ...

On *nixes that have /proc, it's possible to readlink on /proc/self/fd/$FD once the file is opened, and that
should give the properly cased filename.

^^^^ I have also confirmed this. ^^^^ :slight_smile:

I have reimplemented this feature with an eye to performance. I now have it down to 2-3% perf regression in the preprocessor, which I hope is acceptable. I have benchmarked perf on MacOS (with Cocoa.h), Linux (with all of Boost.Spirit), and Windows (with Windows.h). For the record, Windows.h is terrible about properly casing its #includes. :slight_smile:

Incorrectly cased #includes are diagnosed on MacOS, Windows, and on Linux (for case-insensitive file systems). They are even diagnosed in Cygwin. They will not be diagnosed on *nix-like systems that don't have either F_GETPATH (like MacOS) or a mounted /proc filesystem.

Comments welcome.

Eric

P.S. I'm still working on getting clang's tests to run on Windows, so this hasn't been tested there.

llvm.patch (4.11 KB)

clang.patch (52.8 KB)

Any thoughts about this patch? Are folks here still interested? (Sending updated patch with --ignore-space-change.)

\e

clang.patch (15.9 KB)

llvm.patch (4.11 KB)

Hi Eric,

I measured it locally and it doesn't seem to introduce any significant
compile time slowdown. Ideally this should be guarded by a flag so
that we can avoid any potential compile time issue; although there's a
flag declaration in the patch, I don't see it getting used to guard
the warning.

The patch is definitely interesting. I would like to measure it once
more after it gets properly reviewed though.
Can you upload your patches to http://reviews.llvm.org? Please add me
as a reviewer.

Please leave the trailing space / cosmetic fixes out of the patch, it
becomes much easier to review.

Thanks,

Any thoughts about this patch? Are folks here still interested? (Sending updated patch with --ignore-space-change.)

(Relaying second-hand information here). On OS X, I think realpath and
F_GETPATH actually return something from an xnu cache which may or may
not match the case of the underlying file depending on recent
operations.

Apparently for HFS filesystems, apparently fgetattrlist(ATTR_CMN_NAME)
will work, but even that's not guaranteed on all case-insensitive
filesystems (yet?).

Cheers.

Tim.