Extending Clang's serialized diagnostics with source file contents

Hi all,

I’d like to extend Clang’s serialized-diagnostics binary format with the ability to provide the contents of source files that are referenced by diagnostics. This can have a few uses:

  • Make serialized diagnostics files fully self-contained, so you don’t need to have the original source files around to reproduce caret diagnostics. This is particularly good if source files are generated as part of the build and might be hard to get access to in some build environments.
  • Allow source code that’s the compiler generates internally (e.g.,macro expansions) to be emitted into the serialized diagnostics files, so we can reproduce caret diagnostics for them and allow users to inspect the code after-the-fact.

The extension I’m proposing is to add a single new record kind, RECORD_SOURCE_FILE_CONTENTS, to the serialized diagnostics bitstream format. This record contains:

  • The ID of the file for which it is providing contents (i.e., the same ID that will occur in a RECORD_FILENAME).
  • The “original source range” for this buffer, which is where this source file logically resides in another source file. It’s somewhat tied to the macro-expansion use case, where it indicates the place that the macro was expanded, but it can be left as an empty range for other use cases.
  • A blob containing the actual source text.

Serialized diagnostic files are generally read by libclang’s clang_loadDiagnostics, although other implementations exist. We can extend libclang with two additional APIs to get those bits of information from a loaded diagnostic set:

/**
  * Get the contents if the given file that was provided via diagnostics.
  *
  * \param diags the diagnostics set to query for the contents of the file.
  * \param file the file to get the contents of.
  * \param outFileSize if non-null, set to the file size on success.
  * \returns on success, a pointer to the file contents. Otherwise, NULL.
  */
 CINDEX_LINKAGE const char *clang_getDiagnosticFileContents(
     CXDiagnosticSet diags, CXFile file, size_t *outFileSize);

 /**
  * Retrieve the original source range if the given file was provided via
  * diagnostics and is conceptually a replacement for the original source range.
  *
  * \param diags the diagnostics set to query for the contents of the file.
  * \param file the file to get the contents of.
  * \returns on success, the source range (into another file) that is
  * conceptually replaced by the contents of the given file (available via
  * \c clang_getDiagnosticFileContents).
  */
 CINDEX_LINKAGE CXSourceRange clang_getDiagnosticFileOriginalSourceRange(
     CXDiagnosticSet diags, CXFile file);

The serialized diagnostic format has been stable for nearly a decade. Fortunately, libclang will ignore any records it does not know about, so we can add this new record kind without breaking existing implementations, and without a version bump. Old clients will see filenames in diagnostics that they can’t find on disk, but that’s no worse than we have today if the source is missing or moved. Once clients are updated, they’ll get the source file contents.

I have an implementation of this change in this pull request, which pairs with a Swift compiler change to emit macro expansion buffer contents using this mechanism.

Thoughts? Additional use cases?

Doug

3 Likes

Sounds good to me!

Hi Doug,

This sounds awesome. I wonder how this relates to -fmodules-embed-all-files which zips the header content into the AST file.

Best, Vassil

We could do the equivalent of that option for serialized diagnostics, if we wanted to. That’s my full “make serialized diagnostics files fully self-contained” use case, which is what -fmodules-embed-all-files does for serialized ASTs.

Do we actually want that functionality? I don’t know, but it’s pretty easy to build once the serialized-diagnostics change I mentioned is in place.

Doug

This sounds sensible to me, thank you!

Nice, I’ve seen this causing a bunch of unnecessary churn in the past - this is definitely a good addition to have.

Out of curiosity, what kind of policy does your client uses on top of this? Meaning, it will always prefer to consume the serialized content or do you plan to tier this up with out-of-date timestamp checking, hashing, etc?

My particular client involves macro expansion, where I want to retain the source code that results from a macro expansion so that I can, for example, show diagnostics if there is a type-checker error. If we were to add some kind of Clang flag to embed sources in this way, I would probably always have them prefer the serialized content because we know it exactly matches how the program was built.

Doug

1 Like

I was mostly thinking if we could re-use the blob if it was already there in the case -fmodules-embed-all-files. I was not sure if you knew about this obscure but useful flag. I am happy with either approach we take.