RFC: -E mode to preserve system includes

Background
The -E option runs the preprocessor, and emits a compilable file with all the usual preprocessor “stuff” all done. Include files have been included, macros expanded, conditionals selected, all that good stuff. It’s quite common to ask someone to provide a preprocessed input file in order to figure out where their problem is. The advantage of a preprocessed file is that it (ought to be) compilable by anyone, without any need to reconstruct the original user’s environment. As long as you have the same compiler version, and the right target, it should all Just Work.

Problem
These files can be huge. If the problem is a compiler crash, or even if it isn’t, pretty much the first step anyone takes is to reduce the test case down to something manageable. Even with automated tools, this can take time. I personally have had to do this for files in the multiple hundreds of thousands of lines.

Observation
A whole lot of those lines are from standard headers. #include <vector> alone produces almost 25k lines of -E output in my environment. The standard headers are basically never the source of a compiler crash, and yet we have to contend with reducing them over and over again.

Solution
A new option -fkeep-system-includes which modifies -E behavior so that it does not copy any system header files to the output, but instead propagates the #include directive. That is, if you had something like

// my_header.h
void my_code();
// my_code.cpp
#include "my_header.h"
#include <vector>
void my_code() { ... }

then -E -fkeep-system-includes would present you with

void my_code();
#include <vector>
void my_code() { ... }

instead of 25k lines of irrelevant STL code wrapped by the relevant bits.

I believe in most cases the result will still be compilable “anywhere” although obviously the more platform-dependent it gets the closer you have to be to the user’s environment. But IME most cases aren’t like that; and if it is, you can always fall back to unmodified -E.

Patch
To be posted, probably tomorrow. But it’s actually pretty simple, once you get past the overhead of adding a new option.

Credit
@gregb had the idea, and other people on our team thought it would be useful too. I thought I’d propose it upstream because it seems like it should be quite widely useful.

6 Likes

While I have no problem with this feature, and see how it could be useful, I find the interaction with the STL includes is often the cause of crashes (typically template instantiations of STL types), and that swapping versions can lose the crash. While I think there is value to this option, I’d hope it doesn’t become the ‘default’ we have our users/crash recorder do.

One alternative I might consider is having us augment -frewrite-includes to make it more trivial to ‘replace’ the individually named files with the ‘include’ (rather than the contents). This makes the creduce-type step do a simple delete in the case we don’t want it, but leaves it around anyway.

2 Likes

Thanks for suggesting this, I’d love to use that.

On the bikeshedding side of things, instead of a -f flag, should this be a new -d<Letter> thing?

1 Like

I find the interaction with the STL includes is often the cause of crashes (typically template instantiations of STL types), and that swapping versions can lose the crash. While I think there is value to this option, I’d hope it doesn’t become the ‘default’ we have our users/crash recorder do.

+1, it’s especially handy to have that STL code included when the crash is on a different target than the debugger’s host.

So I also think this new option seems like a good idea, but I don’t think it should become the default.

4 Likes

Thanks so much for picking this idea up @pogo59 !

So I also think this new option seems like a good idea, but I don’t think it should become the default.

Absolutely it shouldn’t become the default. A reproducer needs to be as perfectly reproducable as possible when run in a different environment. My use-case is almost entirely in cases where I’m trying to create a reduced testcase on my own machine. I do something like this manually when reducing testcases with a hodgepodge of python scripts that work 95% of the time (by scanning -E -H output to try to determine which headers to leave in place) but having the compiler do this for me would be utterly wonderful.

Thanks, I think this is a pretty neat idea!

I think this would be a -d<Letter> option rather than an -f option, but new letters for -d are reserved to GCC (Preprocessor Options (Using the GNU Compiler Collection (GCC))), so we should talk to them if we plan to pick a new d letter just to be sure we’re not stepping on toes.

Would this be a two-step process? e.g., -E writes out all the content, then a second pass rewrites the included content back to #include form? (I’m not terribly familiar with -frewrite-includes and we seem to have no documentation about the feature.)

1 Like

-frewrite-includes is pretty cool! It is an existing feature that does ‘less’ preprocessing, in that it leaves macros alone and only handles includes. It replaces:

#include <foo>

With:

#if 0
  #include <foo>
#else
...// contents of foo
#endif

I find it really useful for debugging preprocessor issues, and the -frewrite-modules equivalent does a great job for debugging modules issues.

So my idea here I THINK would be to replace that ‘0’ with: #if defined(_CLANG_INCL_FOO) || #defined(_CLANG_INCL_ALL)

or something. That way you could choose to include all? Perhaps we could be smarter and have a separate macro for STL vs not? Or <> vs "" includes?

Either way, its a half-baked idea at the moment :slight_smile:

Definitely would not make -fkeep-system-includes the default.

With respect, I think compiler crashes that came to the attention of library people are far more likely to involve the library headers than compiler crashes in general. I’ve been working with LLVM for a dozen years and I don’t think I’ve ever seen a library-header-related crash. I’ll also note that “system headers” encompasses a lot more than just the STL headers.

I did look at -frewrite-includes briefly but was deterred by the complete lack of documentation; kind of afraid to touch it, actually. InclusionRewriter.cpp appears to do a lot more than just add #if-guarded directives.

Re -d versus -f: I did look at the gcc docs while researching this. The -d options all look like they’re oriented towards debugging the preprocessor, and there are already various -f options that influence preprocessor behavior; that’s why I invented a -f option. The most closely related -d options is -dI which preserves the (unguarded) #include and emits the header content; I suspect the resulting file is not compilable at all.

The behavior of -frewrite-includes is not exactly what @erichkeane reported.

#include <foo>
int bar;

becomes

#if 0 /* expanded by -frewrite-includes */
#include <foo>
#endif /* expanded by -frewrite-includes */
...// contents of foo
int bar;

That is, currently there is no end-marker for the end of <foo> introduced by -frewrite-includes. Poking around inside InclusionRewriter.cpp for where the #if 0 is added, it’s in method CommentOutDirective, called from method Process. Now, Process is called recursively to handle the included content, so it seems potentially feasible to create an end-of-text marker.

Going this route doesn’t make the preprocessed file any smaller, but at least it would make it possible to identify the scope of each included range in a more reliable way than looking at #line markers. @gregb what do you think? I’ll have a go at doing it this way today.

Ah, thanks for the clarification, yeah, it seems like the version in my head is more useful.

I think that -frewrite-includes is incredibly useful, but yes, needs better documentation and could use some improvements. It IS pretty well documented in source however at least.

I presumed the ‘want a smaller file’ is more ‘less to pay attention to’ and not ‘size of the file on my disk’ sorta thing, so I figured a -frewrite-includes would be particularly useful/more easily prunable. It also has the benefit of keeping macros, though we rarely have preprocessor failures.

I’m definitely interested to see how your evaluation goes!

Note: I’m not a ‘library people’ at all, but I definitely see Clang crashes in library code/related to library code more often than not. That said, I specialize in template code, and the STL is very much template code.

Less to wade through while reducing a test case, is really the point. I recently had a case where I happened to have access to the full original source, not a preprocessed file, and being able to comment-out an include rather than try to progressively reduce the content of preprocessed included content was a big win.

2 Likes

What happens when your code uses a macro from a system header? Do you still evaluate it (which gives weird results if you then change the macro) or leave it unevaluated (which may work, but maybe stopping evaluation of a macro partway and evaluating it again from scratch does odd things in certain edge cases)?

What happens when your code defines macros that affect system headers (e.g. _GNU_SOURCE)? You need to leave those as defined so you get the right behaviour from headers, but then if you expand them like -E normally does that breaks things like:

#define foo(x) x
foo(foo)(123)

which would expand after one iteration to:

#define foo(x) x
foo(123)

but after two iterations to:

#define foo(x) x
123
1 Like

GCC has a short awk script called uninclude that takes preprocessed source and use the # line markers to identify included files, and if each file is present on disk, it “unincludes” it back to a #include directive.

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=contrib/uninclude;h=5612e655a985ca54d4aa95647440aa0eafe7d721;hb=refs/heads/master

This is much simpler than modifying the compiler itself.

2 Likes

The awk script would work fine IF your directory layout exactly matches the layout of whoever produced the preprocessed file. In my case, we deliver an SDK to licensees, and don’t have any control over where they put the SDK. Therefore, if I tried to awk my way back to #includes, it’s unlikely to work.

But, because I know the layout within the SDK is identical, if I already had #include directives, I’d be able to compile the test case and be 100% confident I was getting the same behavior, just by pointing my compiler at the same version of the SDK.

Edit: Changing -frewrite-includes is turning out to be relatively simple. If all I wanted to do was change the #endif to #else and add an #endif at the end of the included text, it would probably be about 10 lines. I’m getting fancy and adding the header file name to the comment on these directives, so 27 lines.

Well, that seems like a decent reason to modify -frewrite-includes instead of making it a variant of -E. The former doesn’t do a lot of preprocessor work, although it does do some. There’s no documentation so it’s hard to know exactly what’s done and what’s left alone.

I think the use case makes a lot of sense, and it’s reasonable to put it in the compiler rather than distributing a separate uninclude tool. Bundling debugging aids like this greatly simplifies handling user issues.

I think the concept of “system header” is incredibly nebulous. People use -isystem as a hack to suppress warnings in third party code that they don’t control. It really depends on what set of headers are shared between the user and the developer reproducing the problem. Maybe they have the same STL version, maybe they don’t, maybe it matters, maybe it doesn’t.

One potential way to address this would be to name the directories under which includes should not be expanded, so -E -Ipath/to/headers -fskip-includes=path/to/headers would retain includes from those search paths.

One particular set of headers that we rarely want in the crash reproducer is the intrinsic headers (immintrin.h & co). Those have an even more stable API than the C++ standard library, and they are version locked to the compiler, so skipping them when producing a crash reproducer would helpfully allow for more version drift between the user’s compiler version and the developer’s compiler version. This also means you can start to bisect a crash before you’ve reduced away the intrinsic headers. I don’t see a good reason why this couldn’t be the default crash reproducer behavior. This expands on your use case and strengthens the case for the feature.

So yeah, thanks for the idea, sounds cool. :slight_smile:

1 Like

IIUC the suggestion is to modify -frewrite-includes to make it easy by defining a macro to always take the #if containing the #include rather than the #else containing the text of the included file. I think some of the reasons given are compelling although it makes it a bit less useful out of the box for me, but still workable.

The main issue is that I won’t be able to use clang to preprocess those #else blocks out because it will bring in the text of the included file in the process. That should be easy enough to work around by implementing my own partial preprocessor in python for my use case.

I misspoke, the file doesn’t need to exist on disk. It matches a fixed set of prefixes for “standard headers” and replaces anything under those dirs with a #include for the header name. Even if it doesn’t exist.

So if you want to uninclude libc++ headers just match anything with include/c++/v1/ and replace anything like that. That will uninclude all STL headers.

I’ve modified -frewrite-includes so that

#include "foo.h"

becomes

#if 0 /* foo.h expanded by -frewrite-includes */
#include "foo.h"
#else /* foo.h expanded by -frewrite-includes */
...// content of foo.h
#endif /* foo.h expanded by -frewrite-includes */

which at least has the virtual of delimiting the extent of the included text. A handful of tests needed updating, but basically simple. I’m inclined to think this is worth having all by itself. People reducing test cases can flip a given #if 0 to #if 1 and if that works, hide or rip out the included content. I don’t know about anybody else’s editors, but Visual Studio will happily collapse (hide) a block of conditionalized source text.

Coming up with preprocessor symbols to control this by something like
#if defined(_CLANG_INCLUDE_FOO)
is fraught with peril, however, because the name “foo” is not guaranteed unique. Consider the #include_next directive, for starters. The only guaranteed way to uniquify the preprocessor names is to base them on absolute path names, and that’s really not going to be a pleasant experience.

I’ll try to get PRs up before I quit for the day, so people can see what I did and experiment for themselves.

+1

PRs are up:
Make -frewrite-includes put an endif at the end of the included text by pogo59 · Pull Request #67613 · llvm/llvm-project (github.com)
Add -fkeep-system-includes modifier for -E by pogo59 · Pull Request #67684 · llvm/llvm-project (github.com)
although I haven’t added any reviewers yet.

2 Likes