[RFC] Generate CC1 Command Lines from a CompilerInvocation instance

I am a PhD student at the Imperial College London currently contracting for Apple, working with Michael Spencer and Duncan Exon Smith on providing build system support for explicit module builds through clang-scan-deps. As part of clang-scan-deps we need to provide build systems with a dependency graph of the modules needed for a given build, as well as a command line that can build each needed module explicitly. Under the hood, clang-scan-deps uses the implicit module build mechanisms to discover modular dependencies. In this setup when we need to generate a command line to build an explicit module we use the command line associated with the TU that discovered the given module and append to it “-remove-preceeding-explicit-module-build-incompatible-options” followed by the options needed to build the module explicitly. The trouble with this is that the command line we generate for building an explicit module is not deterministic, it depends on which TU discovers the module first. Furthermore, this scheme makes it extremely difficult to remove options that are irrelevant to the compilation of the module. What we really want to do is to be able to generate a cc1 command line from first principles given a CompilerInvocation instance.

In order to achieve this goal, we need a strategy that allows us to synthesize command line arguments from the various fields of a CompilerInvocation instance. Unfortunately, CompilerInvocation does not currently support this use case. We propose to automate the process of generating a command line by embedding information inside the table-gen option description files. This extra information will define the mapping between a given compiler option and the fields of CompilerInvocation using a mechanism similar to Obj-C’s keypaths. We can then use this to generate code that automatically translates from a given option to the correct value of the associated CompilerInvocation field and back. A patch describing the proposed pattern is available at https://reviews.llvm.org/D79796. Of course this scheme can not work for complicated mapping logic but we feel that it can automate generating command line arguments and parsing options for the vast majority of cases, which would significantly reduce the amount of custom code inside CompilerInvocation. To make sure that developers keep the mappings up to date, we plan to add an assertion at the end of CompilerInvocation::CreateFromArgs that checks that we can roundtrip back the CompilerInvocation through CompilerInvocation::GenerateCC1CommandLine. This should flag failures in debug builds which would make developers aware of the need to keep the mappings up to date without adding additional runtime overhead to clang.

Thanks for reading and looking forward to everyone’s comments!

Daniel Grumberg

Broadly speaking, that sounds great. The cc1 interface is unstable, and we can change it to make the mapping from internal state to command line and back simpler. The driver code would benefit much more from this kind of refactoring, but the driver flag handling is so idiosyncratic that this isn’t really feasible.

My input on working with tablegen is: try to generate a little C++ code as humanly possible. Please try hard to generate readonly data instead of code. It compiles much more quickly, and usually runs just as quickly.

Our codebase has a bit of a legacy of table generated code gone wrong, and I want to move away from that where possible. See these (stale) stats on the individual functions that take the longest to compile:
https://reviews.llvm.org/P8191$95 llvm::decodeToMCInst
They are generated here: https://github.com/llvm/llvm-project/blob/master/llvm/utils/TableGen/FixedLenDecoderEmitter.cpp#L964
I have tried to optimize the code to compile faster, but so far I’ve had no luck.

This sounds like a great project to me.

There is already at least one place in Clang that wishes it could do this: when building with -E -frewrite-imports, we build a source file that embeds sources that can be used to transitively rebuild dependency modules. (The dependency modules are expressed via pragmas.) But we can’t faithfully represent the options that were used to build those modules, because there’s no way to serialize the options from an arbitrary AST file as text. This matters in particular for an explicit modules build, where the set of flags used to build imported modules may differ from the set of flags used to build the consumer of that module.

We thought about generating read-only tables by embedding offsets into the different levels of internal states (i.e. generating a table with series of member pointers into the various classes), but we felt that it would add more complexity than it is worth. Ultimately all the options have some code that checks if the appropriate command line flag(s) were given. For example the code we removed for -fmodules-strict-context-hash is:

Opts.ModulesStrictContextHash = Args.hasArg(OPT_fmodules_strict_context_hash);

with some previous code defining Opts to be a reference to the HeaderSearchOpts field of CompilerInvocation.
Once expanded the code for parsing the argument our macro generates would look something like:

if (Option::FlagClass == Option::FlagClass)
  Res.HeaderSeachOpts->ModulesStrictContextHash = Args.hasArg(OPT_fmodules_strict_context_hash) && true;

It also looks really similar for generating the cc1 command line. You do have to go through the preprocessor to get there but this is still significantly cheaper than doing template instantiations like in the example you linked, if I understand the whole thing correctly. Let me know if we have missed something and there is a way of encoding the arbitrary keypath in a read-only table without too much pain that we missed.


Also I added you as a reviewer to the first patch (https://reviews.llvm.org/D79796), that way you can express concrete concerns with our approach over there.