Patch for main file ID and stat cache lookup issues when loading AST files

Attached are two patches (svn diff format) that correct two issues encountered when loading AST files generated by Clang with the '-emit-ast' option.

Problem 1: Loading an AST file via ASTUnit::LoadFromASTFile() sets ASTUnit::MainFileIsAST to true so that calls to ASTUnit::isMainFileAST() return true. However, SourceManager::MainFileID does not get restored. This is addressed by clang-main-file-id.patch

Problem 2: When the main source file is compiled using a relative path (ie, 'clang -emit-ast file.c'), filesystem stat lookups use the relative path exactly as passed on the command line. This results in the stat cache populated by ASTWriter via MemorizeStatCalls using the relative path as the key to the stat cache for the main file. However, the main source file name is also stored in the AST file using an absolute path as the "original file". When the AST file is later read by ASTReader, the original (absolute) path is used for filesystem stat lookup resulting in a cache miss for the stat cache because the absolute path doesn't match (string comparison) the relative path used as a key in the stat cache. In this case, the real filesystem is queried which may return stat data that does not reflect the stat values from the time of the compilation if the filesystem stat values have changed. This is addressed by clang-main-file-stat-cache.patch which simply converts input file names to absolute paths before passing them on to FrontendAction::BeginSourceFile().

A test program is attached to demonstrate these problems. When compiled and linked with an unpatched version of clang, the test program will abort on a failed assertion:
   Assertion `!source_manager.getMainFileID().isInvalid()' failed.

Applying clang-main-file-id.patch will correct the assertion failure. At this point, the second problem can be demonstrated with the following steps:

1) Compile an example .c file with 'clang -emit-ast file.c'. Make
    sure to use a relative path to the source file.
2) Run the test program against the .ast file generated in step 1.
    Output similar to the following should be displayed:
      main file: /path/to/file.c
          Size: 15
          Time: 1324402591: Tue Dec 20 12:36:31 2011
          Mode: 0100644
3) Change the mode of the .c file (ie, 'chmod g+w file.c'), and
    rerun the test program. If the stat cache is working properly,
    no changes should be seen. However, without the attached patch,
    a change to the mode is seen:
      main file: /path/to/file.c
          Size: 15
          Time: 1324402591: Tue Dec 20 12:36:31 2011
          Mode: 0100664

The reason the steps above change the file mode is because the AST reader will notice changes to the file size or modification stamp and abort the AST load if either fails to match. Changing the file mode avoids the load failure, but still demonstrates the cache miss.

I suspect that stat cache lookups will remain fragile as long as the stat cache continues to use non-normalized paths as keys. I haven't attempted to address this.

These issues were present in Clang 2.9 and 3.0 as well.

Tom.

clang-main-file-id.patch (1.7 KB)

clang-main-file-stat-cache.patch (1.21 KB)

testcase.cpp (2.15 KB)

Attached are two patches (svn diff format) that correct two issues
encountered when loading AST files generated by Clang with the
'-emit-ast' option.

Problem 1: Loading an AST file via ASTUnit::LoadFromASTFile() sets
ASTUnit::MainFileIsAST to true so that calls to
ASTUnit::isMainFileAST() return true. However,
SourceManager::MainFileID does not get restored. This is addressed
by clang-main-file-id.patch

Addressed in r147612.

Thank you!

Problem 2: When the main source file is compiled using a relative
path (ie, 'clang -emit-ast file.c'), filesystem stat lookups use
the relative path exactly as passed on the command line. This
results in the stat cache populated by ASTWriter via
MemorizeStatCalls using the relative path as the key to the stat
cache for the main file. However, the main source file name is
also stored in the AST file using an absolute path as the "original
file". When the AST file is later read by ASTReader, the original
(absolute) path is used for filesystem stat lookup resulting in a
cache miss for the stat cache because the absolute path doesn't
match (string comparison) the relative path used as a key in the
stat cache. In this case, the real filesystem is queried which may
return stat data that does not reflect the stat values from the
time of the compilation if the filesystem stat values have changed.
This is addressed by clang-main-file-stat-cache.patch which simply
converts input file names to absolute paths before passing them on
to FrontendAction::BeginSourceFile().

Ever since we made the AST reader to check all the files (as you
noticed) for robustness, the stored stat cache that the ASTWriter
populates became irrelevant, I'll probably remove it from the AST
file. This is not blocking you for something, right ?

The change to check all the files is not always desirable. It makes perfect sense when one wants to ensure a PCH file is up to date, but is not necessarily desirable for other uses of AST files. With Clang 2.9, one can generate an AST file, modify/delete all the source files used to generate the AST file, and still use the AST file for code generation, analysis, etc... later on. That no longer works with 3.0 without first disabling validation (sort of - deleted files don't fail the load, but modified files do). For those like myself using ASTUnit::LoadFromASTFile(), there are currently no options to disable validation without changing the source code.

For my purposes, the stat cache is important so that I can retrieve the file size, modification time, etc... used to create the AST file when analyzing the AST file on a different machine or after the source files have been modified or removed. In my case, missing or modified source files are more of a warning situation, not a hard error as I don't always need access to the original source.

Tom.

Attached are two patches (svn diff format) that correct two issues
encountered when loading AST files generated by Clang with the
'-emit-ast' option.

Problem 1: Loading an AST file via ASTUnit::LoadFromASTFile() sets
ASTUnit::MainFileIsAST to true so that calls to
ASTUnit::isMainFileAST() return true. However,
SourceManager::MainFileID does not get restored. This is addressed
by clang-main-file-id.patch

Addressed in r147612.

Thank you!

Problem 2: When the main source file is compiled using a relative
path (ie, 'clang -emit-ast file.c'), filesystem stat lookups use
the relative path exactly as passed on the command line. This
results in the stat cache populated by ASTWriter via
MemorizeStatCalls using the relative path as the key to the stat
cache for the main file. However, the main source file name is
also stored in the AST file using an absolute path as the "original
file". When the AST file is later read by ASTReader, the original
(absolute) path is used for filesystem stat lookup resulting in a
cache miss for the stat cache because the absolute path doesn't
match (string comparison) the relative path used as a key in the
stat cache. In this case, the real filesystem is queried which may
return stat data that does not reflect the stat values from the
time of the compilation if the filesystem stat values have changed.
This is addressed by clang-main-file-stat-cache.patch which simply
converts input file names to absolute paths before passing them on
to FrontendAction::BeginSourceFile().

Ever since we made the AST reader to check all the files (as you
noticed) for robustness, the stored stat cache that the ASTWriter
populates became irrelevant, I'll probably remove it from the AST
file. This is not blocking you for something, right ?

The change to check all the files is not always desirable. It makes
perfect sense when one wants to ensure a PCH file is up to date, but is
not necessarily desirable for other uses of AST files. With Clang 2.9,
one can generate an AST file, modify/delete all the source files used to
generate the AST file, and still use the AST file for code generation,
analysis, etc... later on.

That's not entirely true. Some operations use the source-location information to go back to the source code and get extra information, such as when one asks the Lexer/Preprocessor to change a token-based range into a character-based range. Or if one looks at the tokens in a macro, inspecting a literal token requires one to go back to the source code. Generating debug info also needs this information.

One could try to stay away from such operations, of course, but it's hard to be certain that you've done so.

That no longer works with 3.0 without first
disabling validation (sort of - deleted files don't fail the load, but
modified files do). For those like myself using
ASTUnit::LoadFromASTFile(), there are currently no options to disable
validation without changing the source code.

That's a fair point. Different clients may accept that the PCH file can't be used to extract complete information about the translation unit. A command-line compiler, for example, could probably delay checking the on-disk files until they are actually needed (if they are needed), trading some safety---we've never made recovery from those failures perfect---for some performance. When libclang is running in an IDE, however, we need that extra checking because we're more prone to crashing if the filesystem changes out from under us.

For my purposes, the stat cache is important so that I can retrieve the
file size, modification time, etc... used to create the AST file when
analyzing the AST file on a different machine or after the source files
have been modified or removed. In my case, missing or modified source
files are more of a warning situation, not a hard error as I don't
always need access to the original source.

Interesting. I guess I'd be fine with the stat cache coming back, and to allow more configurability with how strict the checking is when the AST file is initially loaded.

  - Doug