Feature proposal: Compile Configuration Disclosure

Hi all,

I'm interested in implementing a feature in Clang. The basic idea is that, given a new command-line switch, like perhaps:
  --disclose-config
... Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler. Note that tools based on Clang libraries could make use of information in disclosure files.

If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If people see this as a desirable feature and reasonable to implement in Clang, then the next few questions are naturally raised:

    1. What information would be output?
    2. Where would it be output?
    3. What format would be used in the output?

1) What information would be output?

Hi all,

I'm interested in implementing a feature in Clang. The basic idea is that, given a new command-line switch, like perhaps:
--disclose-config
... Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler. Note that tools based on Clang libraries could make use of information in disclosure files.

I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If successful, one should be able to configure compiler A to build things like compiler B, just by dumping the disclosure file from one and importing it into the other, no?

If people see this as a desirable feature and reasonable to implement in Clang,

Seems reasonable to me.

then the next few questions are naturally raised:

   1. What information would be output?
   2. Where would it be output?
   3. What format would be used in the output?

1) What information would be output?

At minimum, a tool *needs* the following information for each translation unit:

List of Needs:
--------------

   - The complete set of predefined macro definitions;

   - The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;

   - The ordered sequence of implicit #include directives;

The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

   - The path name of the primary source file;

   - The working directory;

   - Environment variables;

How would these affect translation ???

   - The assumed encoding of the source file and the internal encoding used during processing;

   - The name & version of the compiler;

   - argv; and

   - sizes of primitive types (except in the case where the main output is preprocessor output).

So those are all of the "needs" as I see them.

Here are some "wants":

It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.

*That* is going to be tricky, because it's a lot of information to encode in each object file/library/executable. Better to start with the tools dumping/loading their configurations.

If each tool in the tool chain, when given the --disclose-config option, also sent info about its inputs and outputs to a text file, then this could be achieved.

Sure.

So for example, the linker would need only respond to "--disclose-config" by dumping (again, to a secondary file) its environment and the names of input files & output files. Renames could be detected and accounted for if SHA-1 values were passed along the way.

Other "wants" include:

   - dialect information for the TU

   - brief info on language extensions

   - SHA of the preprocessing token sequence

Keeping an unbroken chain of SHA values (a la git) could have other uses, like verifying that a specific build has been reproduced exactly (not counting certain things like different expansions of __DATE__ and __TIME__). This could be very handy when migrating between build systems or when a user is expected to deliver source code along with a program image.

This feels like you're moving toward a full model of the steps it takes to compile a "project" (as an IDE or make system would know about).

2) Where would the configuration disclosure
  information be output?

Personally, I don't care, and I'd be happy to try to follow any request for which there is no obvious contraindication. But earlier, Doug suggested (in a semi-private email) that each config-disclosure file should be placed beside its associated output file, and that the name simply be something like ${MAIN_OUTPUT_FILE}.config-disclosure. E.g. for this invocation:

  clang --disclose-config -c x.cpp -o foo.o

...the disclosure file would be placed in
   foo.o.config-disclosure
in the same directory as foo.o (which in this case happens to be the working directory). (By the way, if anyone is not happy about that filename extension, I'm totally happy to change it.)

3) What format would be used in the output?

I'm guessing there would be a little flux about this for a while, so XML seems like a natural and universally-grokked choice. But if people prefer other formats like JSON I'd look at that too. The KISS principle should probably apply though.

I don't have a strong opinion on these details.

  - Doug

Hi all,

I'm interested in implementing a feature in Clang. The basic idea is that, given a new command-line switch, like perhaps:
--disclose-config
... Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler. Note that tools based on Clang libraries could make use of information in disclosure files.

I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

My motives were so self-serving that this did not occur to me. (:

I assumed that the driver of a 3rd-party tool would have its own disclosure-file-parser and set things up appropriately.

But for completeness, direct support for it in Clang would seem to make sense.

If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If successful, one should be able to configure compiler A to build things like compiler B, just by dumping the disclosure file from one and importing it into the other, no?

Right.

If people see this as a desirable feature and reasonable to implement in Clang,

Seems reasonable to me.

Glad you think so!

I'll ask about implementation details in a separate thread.

then the next few questions are naturally raised:

  1. What information would be output?
  2. Where would it be output?
  3. What format would be used in the output?

1) What information would be output?

At minimum, a tool *needs* the following information for each translation unit:

List of Needs:
--------------

  - The complete set of predefined macro definitions;

  - The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;

  - The ordered sequence of implicit #include directives;

The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

Important tip; thanks!

  - The path name of the primary source file;

  - The working directory;

  - Environment variables;

How would these affect translation ???

Well, some compilers depend on environment variables. E.g. MSVC uses the variable INCLUDE, which names directories containing headers that ship with the compiler. And GCC uses several environment variables; see the section 'ENVIRONMENT' in the GCC manual.

Granted, INCLUDE would already be covered in the second bullet point above. But it's always possible that the environment will contain something that alters a compiler's behavior in some unexpected way.

The whole point of this proposal is to bring to light essential (or potentially essential) details that have traditionally caused grief as a result of being hidden. So if a compiler relies on the environment then the environment is just another component of the configuration; therefore it's necessary in order to reproduce the compiler's behavior (and therefore appropriate to reveal in a config-disclosure file).

  - The assumed encoding of the source file and the internal encoding used during processing;

  - The name & version of the compiler;

  - argv; and

  - sizes of primitive types (except in the case where the main output is preprocessor output).

So those are all of the "needs" as I see them.

Here are some "wants":

It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.

*That* is going to be tricky, because it's a lot of information to encode in each object file/library/executable.

Ouch! I see that I gave the wrong idea about this.

I really need to be careful in expressing this point: I do *not* propose to encode *anything* within object files! I only desire for each tool in the tool chain to dump,
  >>> into a *new*, *separate*, plain-text file, <<<
information that it already knows about (like names of input files and names of output files).

Of course, if files get renamed (or moved across a network), that would make it harder to piece things back together. Hence the additional "want" about SHA values. Then a tool can just invoke:

   find some-dir -iname '*.config-disclosure'

... and with that *alone*, it could see how every stage of the build is connected to every other stage.

Better to start with the tools dumping/loading their configurations.

Right.

If each tool in the tool chain, when given the --disclose-config option, also sent info about its inputs and outputs to a text file, then this could be achieved.

Sure.

So for example, the linker would need only respond to "--disclose-config" by dumping (again, to a secondary file) its environment and the names of input files & output files. Renames could be detected and accounted for if SHA-1 values were passed along the way.

Other "wants" include:

  - dialect information for the TU

  - brief info on language extensions

  - SHA of the preprocessing token sequence

Keeping an unbroken chain of SHA values (a la git) could have other uses, like verifying that a specific build has been reproduced exactly (not counting certain things like different expansions of __DATE__ and __TIME__). This could be very handy when migrating between build systems or when a user is expected to deliver source code along with a program image.

This feels like you're moving toward a full model of the steps it takes to compile a "project" (as an IDE or make system would know about).

Yes; that's where I would like things to go.

James Widman

Ack; forgot the other two bullets.

- The path name of the primary source file

Ho else would we know which .cpp file to parse?

Also, note that pre-defined macros, the #include search path (and generally, details about how the compiler is invoked) can vary from TU to TU. So in:

  clang --disclose-config -DBAR -c x.cpp -o foo.o

... the file foo.o.config-disclosure would reveal not just that clang was once invoked with a predefined "#define BAR 1", but that this invocation was used on x.cpp. Now, when a tool processes x.cpp, it's guaranteed to start with the same set of predefined macros that clang started with. The set---er, sorry; the *sequence* of predefined macros may differ for y.cpp.

- The working directory

You could easily have more than one "x.cpp" in a directory tree. One might then suggest that x.cpp's full pathname be output instead. And that's fine, but:

   - it could have an effect on how x.cpp is cited in diagnostics.

   - in the process of forming the full pathname, you first have to get the working directory anyway.

Also,-I options typically include things like -I. and -I../asdf.

James Widman

So bullet points 1 & 3 boil down to something simpler (or at least more general):

    - The sequence of preprocessing directives implicitly processed (in effect) at the outset of each translation unit

James Widman

Incidentally, GCC does not seem to care about the order of -D and -include options; e.g. for a.h with:

#ifdef Bar
#error Bar defined!
#endif

... and the invocation:

g++ -DFoo -include a.h -DBar empty.cpp

... GCC hits the #error. (Ditto for clang.)

However, the updated (merged) bullet point still makes more sense: if anyone decides to start caring about the order of these things, the updated bullet point would cover it. And it's just clearer that way. So if/when GCC implements this, their disclosure file should contain:

#define Foo 1
#define Bar 1
#include "./a.h"

James Widman

I see something about this that might be viewed as a hiccup: --disclose-config needs to be passed on to 'clang -cc1', whose output is a file placed in $TMPDIR.

So from the perspective of 'clang -cc1', the natural name for the disclosure file would be something like:

   $TMPDIR/cc-F6CVkg.s.disclose-config

And programs are expected to clean up whatever they write to $TMPDIR, but disclosure files are meant to persist. (They're also meant to be visible to the user, and it probably helps the user interface if the filename is based on the output filename one expects when invoking the driver.)

So I can see a few ways to deal with this:

1) Just say that --disclose-config is just an experimental feature, and in this 0.1 version, we want to see what the output looks like before we start worrying too much about command-line interface details like this. It would work well enough for Clang developers, but it feels a little more half-baked than I would really like.

2) Have the driver deduce an appropriate filename for the disclosure file (as originally suggested above), and pass that name on to 'clang -cc1'. So even though '-cc1' is outputting an assembly file, the disclosure filename would be "foo.o.config-disclosure". It would work, but it feels wrong because '-cc1' never actually sees the machine code; it only sees the assembly code. So to mention 'foo.o' in the disclosure file feels too much like a fib. (And if we get around to outputting SHA values, it really won't work because '-cc1' will not be privy to the sequence of bytes of the object file.)

3) Let -cc1 make a "*.s.disclose-config" file (based on the name of the assembly file it produces); and make sure the driver removes it before it exits (unless -save-temps is given of course). But before removing the ".s.disclose-config" file, the driver should read it in, modify the "output file" portion (so that it now references the object file produced by the invocation of the assembler) and then write out a ".o.disclose-config" (where the name is based on the name of the actual object file).

Option (3) seems as close to "right" as we can get for the time being without modifying the assembler. If/when we can modify the assembler (or if/when '-cc1' just generates object code), we can then get rid of the part where the driver reads in the disclosure file.

Also: I don't know what to do for the -pipe case. Granted, -pipe doesn't seem to make a difference for now, but presumably it will in the future. If nothing else, I guess the hash of a stage's output can be used as the base of a disclosure file's name.

James Widman

List of Needs:
--------------

- The complete set of predefined macro definitions;

- The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;

- The ordered sequence of implicit #include directives;

The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

So bullet points 1 & 3 boil down to something simpler (or at least more general):

   - The sequence of preprocessing directives implicitly processed (in effect) at the outset of each translation unit

I like that a lot.

   - Doug

Hi all,

I'm interested in implementing a feature in Clang. The basic idea is that, given a new command-line switch, like perhaps:
--disclose-config
... Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler. Note that tools based on Clang libraries could make use of information in disclosure files.

I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

My motives were so self-serving that this did not occur to me. (:

I assumed that the driver of a 3rd-party tool would have its own disclosure-file-parser and set things up appropriately.

But for completeness, direct support for it in Clang would seem to make sense.

Yes. I'd hang it off CompilerInvocation, which incapsulates a... Compiler invocation.

If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If successful, one should be able to configure compiler A to build things like compiler B, just by dumping the disclosure file from one and importing it into the other, no?

Right.

If people see this as a desirable feature and reasonable to implement in Clang,

Seems reasonable to me.

Glad you think so!

I'll ask about implementation details in a separate thread.

then the next few questions are naturally raised:

1. What information would be output?
2. Where would it be output?
3. What format would be used in the output?

1) What information would be output?

At minimum, a tool *needs* the following information for each translation unit:

List of Needs:
--------------

- The complete set of predefined macro definitions;

- The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;

- The ordered sequence of implicit #include directives;

The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

Important tip; thanks!

- The path name of the primary source file;

- The working directory;

- Environment variables;

How would these affect translation ???

Well, some compilers depend on environment variables. E.g. MSVC uses the variable INCLUDE, which names directories containing headers that ship with the compiler. And GCC uses several environment variables; see the section 'ENVIRONMENT' in the GCC manual.

Granted, INCLUDE would already be covered in the second bullet point above. But it's always possible that the environment will contain something that alters a compiler's behavior in some unexpected way.

The environment is harder to control, which is unfortunate. I wonder if all of the environment variables have corresponding flags?

The whole point of this proposal is to bring to light essential (or potentially essential) details that have traditionally caused grief as a result of being hidden. So if a compiler relies on the environment then the environment is just another component of the configuration; therefore it's necessary in order to reproduce the compiler's behavior (and therefore appropriate to reveal in a config-disclosure file).

- The assumed encoding of the source file and the internal encoding used during processing;

- The name & version of the compiler;

- argv; and

- sizes of primitive types (except in the case where the main output is preprocessor output).

So those are all of the "needs" as I see them.

Here are some "wants":

It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.

*That* is going to be tricky, because it's a lot of information to encode in each object file/library/executable.

Ouch! I see that I gave the wrong idea about this.

I really need to be careful in expressing this point: I do *not* propose to encode *anything* within object files! I only desire for each tool in the tool chain to dump,

into a *new*, *separate*, plain-text file, <<<

information that it already knows about (like names of input files and names of output files).

Of course, if files get renamed (or moved across a network), that would make it harder to piece things back together. Hence the additional "want" about SHA values. Then a tool can just invoke:

  find some-dir -iname '*.config-disclosure'

... and with that *alone*, it could see how every stage of the build is connected to every other stage.

Ah, okay.

Sent from my iPhone

Hi all,

I'm interested in implementing a feature in Clang. The basic idea is that, given a new command-line switch, like perhaps:
--disclose-config
... Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler. Note that tools based on Clang libraries could make use of information in disclosure files.

I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

My motives were so self-serving that this did not occur to me. (:

I assumed that the driver of a 3rd-party tool would have its own disclosure-file-parser and set things up appropriately.

But for completeness, direct support for it in Clang would seem to make sense.

Yes. I'd hang it off CompilerInvocation, which incapsulates a... Compiler invocation.

Makes sense.

[...]

- Environment variables;

How would these affect translation ???

Well, some compilers depend on environment variables. E.g. MSVC uses the variable INCLUDE, which names directories containing headers that ship with the compiler. And GCC uses several environment variables; see the section 'ENVIRONMENT' in the GCC manual.

Granted, INCLUDE would already be covered in the second bullet point above. But it's always possible that the environment will contain something that alters a compiler's behavior in some unexpected way.

The environment is harder to control, which is unfortunate. I wonder if all of the environment variables have corresponding flags?

Huh? On POSIX, just use the global variable:

  extern char **environ;

I think on Windows it's spelled '_environ'.

In either case, each array element points to a string of the form:

  name=value

... where 'name' does not contain '='. When serializing, we simply dump that out (into an XML element). When de-serializing, I guess there might be an option between a "merge" vs. "clobber": in merge, you simply do setenv() for each of the definitions given in the config-disclosure file; in clobber, you first unsetenv(name) (for each name in environ that is not in the disclosure file) and then you setenv(name,value,1) (for each name/value pair from the config-disclosure file).

Or for simplicity we just clobber, and if someone really needs "merge" for some purpose they can do it on their own.

Did I miss something?

James Widman

I have some similar interests, perhaps we should talk offline. Send me your phone (privately!) and I’ll give you a call if you want.

To test blocks, for example, I have a little hacked up tool that can build and test source language level tests; for each test the tool currently can:

  • compile a test using any of the {C, ObjC, C++, ObjC++} compiler variants as {32,64} bit binaries, using {-c99, {-fobjc-gc, -fobjc-gc-only, no-gc}, -O{1,2,3}} as basic configurations, and then test that the variation either {expectedCompileFailWithMessage, expectedWarningWithMessage, expectedCleanCompile} and if so then tests {expectedRunFatal, expectedRunClean, expectedRunWithWarning}… etc.

For some tests I really need a 32-bit only system, I need to test against older releases of the OS, etc.

Its screaming for a BuildBotFarm.

As part of that, I will refine my existing configuration schema system so that it can drive clang from the outside, as in, here’s a spec, fork/exec/monter clang with the options. I would love to drive it from the inside as well for other purposes, so I would love to discuss how to specify the “Configuration Space”. The LLVM/Clang OptionParser stuff is really cool, the goal would be to produce an option config from the spec that is equal to the command line, right?

I’m currently a fan of YAML as an portable representation format, but there is a tiny little configuration language that would drive into/outof the representation. Enough said. :slight_smile:

Blaine