C++20 Modules, PCM files encoding implementation

bukebeyond · November 17, 2020, 12:50pm

Dear LLVM team,

I am new. Please forgive me for a bit, if I bump into any established protocol.

I was referred to this mailing list to infer about the direction of the C++20 Modules, as we are considering an evolution of our codebase to the feature.

First let me reiterate some appreciation for what your team is doing for the world of computer science with the outstanding and pioneering work on clang. We have successfully switched our codebase from MSVC to clang for the superior code generation and the most modern features of C++. We have on average, doubled our compilation speed, tripled our execution speed, and halved the binary size. We have converted to C++20 Concepts from old SFINAE hacks, use direct builtins for modern instructions, and continue to marvel at the excellence of SIMD vectorization.

Our work is focused on exploring algorithmic frontiers in film and pro audio production. Artistic quality, precision and execution speed are paramount. This type of algorithm research and development typically demands fast iteration of the implementation code, often a few formulaic pages of DSP, rapidly changing to meet the speed and quality needs of the production team. The interface to consuming that code changes much less often.

Our question is about the current encoding of the pcm files generated from the module cppm files. We envision accelerating and simplifying development by converging most h files and cpp files into single module files. Currently, clang can compile a cppm file to a pcm file, to be consumed by the module importing code and the code editor enhancement clangd.

The implementation code is inside the module, inplace (not to be confused with the inline keyword for function inlining). The cppm file can be independently compiled to an object file and normally linked to produce function calls from consumers.

The current naive build systems can use the pcm file as a dependency, when the interface and layout of the classes change inside the module, to trigger efficient recompilations of the consuming code.

However, we have observed that the pcm file is growing in size as the inplace implementation code is growing in size. We envisioned the pcm would only extract the class interface and the memory layout, but that does not seem to be the case. Perhaps, it would take more LLVM effort to extract and isolate that from the AST tree.

This of course has the unfortunate side effect of triggering redundant rebuilds of large portions of the codebase, making the iteration times unacceptable versus older conventions. The older conventions of splitting .h and .cpp files involve repeating yourself, wasting developer focus on simultaneously editing and managing of 2 files at once, and often resorting to messy pimpl techniques that have to heap allocate a backend and manage 2 references throughout the formulas, etc. Considering these overheads, there are 100s and an ever growing number of small and large plugins (modular effects) that can benefit from convergence to modules as single and succinct files focusing on clean formulation.

Are there any future plans at LLVM, the pcm files may encode the interface only? Or are there any tools and functions you can recommend to extract the module interface to signal the build system more efficiently?

Thank you for your time.

Sincerely,

dblaikie · November 17, 2020, 9:45pm

I don't think anyone's actively looking at this right now - perhaps
partly because there's still significant benefit to separating the
interface and implementation, even when using modules (no extraneous
rebuilds when you change the implementation - even if that rebuild
only rebuilds the interface and then you have a hash (rather than
timestamp) based build system that finds the interface to be identical
and so nothing else downstream is touched). Also at least with Clang's
model, I think the idea is to build the object file from the pcm
rather than from the cppm file. Though the possibility of having two
output files has some potential benefits, to be sure - I /think/ maybe
MSVC is doing something more like that two file model, but I don't
know for sure.

zygoloid · November 17, 2020, 10:50pm

+Iain Sandoe, who has been looking at C++ Modules implementation issues in Clang.

I think it would make sense to be able to emit a “cut-down” pcm file that omits information that an importer of the module never needs (such as definitions of non-inline functions), in order to keep the file sizes smaller. However, that alone won’t be enough to avoid rebuilds when the .cppm files change – we also encode source location information in the pcm file that would be invalidated whenever implementation details change – or at least whenever they change size. In principle, there are techniques we could use here to avoid rebuilds when only those locations change, such as splitting the location information out into a separate file that is not listed as a dependency of downstream compilations (eg, according to -M), but that would need investigation.

Another promising idea that has not been investigated is the possibility of generating two different pcm’s for each module: one containing only cut-down ‘forward declaration’-level information (no class definitions, no inline function bodies, and so on), and one containing full information. The idea would be that we initially load only the cut-down version, and pull in the full information (and include the additional file as a dependency according to -M) only if the dependent compilation needs that information. Then we can avoid rebuilds if (for example) a class definition in a module interface changes but the consumers of that module interface didn’t actually use the class definition.

But I don’t think anyone has done any work to implement these approaches.

bukebeyond · November 18, 2020, 8:28am

Thank you for your fine analysis Mr. Blaikie and Mr. Smith.

Perhaps the most practical and the fastest to implement solution, is to give clang a -generate-module-interface-hash command that takes in a pcm file, scans the AST, skips the inplace implementation code and outputs a simple hash.

Without such a facility, current build systems will be unable to build efficiently, further delaying the adoption of modules.

There are the common benefits that modules will resolve macro collisions, but those are not too common or too unsettled in practice. And that they will speed up compilation, but real world tests show a slight gain over previous pch (precompiled header) techniques.

Thus, the feature they can modernize and clean up traditional C++ source code is perhaps the most significant potential of modules. Not only can this accelerate future development for seasoned experts, giving them extra focus on complexer problems, but also will make the language more attractive to a wider audience of math, science, and art fields.

My thoughts on the all-in-one pcm files containing compiled machine instructions and hefty debug symbols are mixed. Currently, lldb demands full -gdwarf generation, otherwise it will skip over stepping into the module calls. Such heft in the pcm file, may slow down the compilation of module importing code. Certainly, there are some savings on the parsing and the generation of the AST tree from the module source just once, however the current ability of compiling the modules to the object files separately and with varying release options seems like a simpler and more flexible design. Furthermore, how are the inline functions encoded in the pcm? Hopefully they are still in AST form, and not machine instruction form, so they can be better fused, vectorized and optimized at the usage site.

With these design considerations, it seems the pcm file encoding will be in a state of flux for a while, and that is fine. Even the clangd team is reporting crashes at loading pcm files of the slightest variation of the version .

So before the pcm format reaches stability and standardisation one day, there needs to be something like the above hash facility to complete the build usability for the primetime adoption of modules.

Microsoft’s aging compiler is significantly behind clang at code generation. They will likely try to compensate for the gap with modern usability features like modules. Currently, Intellisense is failing at parsing Concepts, even with the recent update promising to do so. Intellisense is also having significant bugs and problems with modules. Clangd is almost there, although it is crashing on using templates from modules on Windows.

Looking forward to hearing your thoughts.

dblaikie · November 30, 2020, 10:59pm

Thank you for your fine analysis Mr. Blaikie and Mr. Smith.

Perhaps the most practical and the fastest to implement solution, is to give clang a -generate-module-interface-hash command that takes in a pcm file, scans the AST, skips the inplace implementation code and outputs a simple hash.

Sounds plausible - not sure how it'd compare in implementation
complexity compared to something that would emit an interface
precompiled module (Either from the original source (possibly while
generating the full precompiled module too) or from the full
precompiled module).

Though I'm not aware of anyone who's working on this/related things right now.

Without such a facility, current build systems will be unable to build efficiently, further delaying the adoption of modules.

There are the common benefits that modules will resolve macro collisions, but those are not too common or too unsettled in practice. And that they will speed up compilation, but real world tests show a slight gain over previous pch (precompiled header) techniques.

Thus, the feature they can modernize and clean up traditional C++ source code is perhaps the most significant potential of modules. Not only can this accelerate future development for seasoned experts, giving them extra focus on complexer problems, but also will make the language more attractive to a wider audience of math, science, and art fields.

My thoughts on the all-in-one pcm files containing compiled machine instructions and hefty debug symbols are mixed. Currently, lldb demands full -gdwarf generation, otherwise it will skip over stepping into the module calls. Such heft in the pcm file, may slow down the compilation of module importing code.

Yep, there is prototype support for, in an explicitly built module
world (& I think this is usable/used in the C++20 modules
implementation) to build an object file from a pcm separately - rather
than building the debug info into the object-file-as-pcm. With this
separate pcm->object step, that object contains not only DWARF but
also contains versions of functions defined in the module that can be
linked into users (so that every use of the module doesn't have to
carry comdat versions of inline functions that haven't been inlined -
instead they can rely on there being a definition available in the
module's object file). This approach also means that the debug info
and code generation for the module aren't a prerequisite for building
users of the pcm, increasing build parallelism, etc.

Certainly, there are some savings on the parsing and the generation of the AST tree from the module source just once, however the current ability of compiling the modules to the object files separately and with varying release options seems like a simpler and more flexible design. Furthermore, how are the inline functions encoded in the pcm? Hopefully they are still in AST form, and not machine instruction form, so they can be better fused, vectorized and optimized at the usage site.

Yes, in a pcm file it's the serialized AST only (except for the DWARF
stuff in the -gmodules thing on MacOS), so compilation of uses of the
pcm have all the same opportunities for optimization they would have
with inline definitions in headers. (& even non-inline functions in a
module could be used for optimization at compile time if the
implementation chose to do so, I believe)

With these design considerations, it seems the pcm file encoding will be in a state of flux for a while, and that is fine. Even the clangd team is reporting crashes at loading pcm files of the slightest variation of the version .

So before the pcm format reaches stability and standardisation one day,

This, to the best of my knowledge, is a non-goal. Clang isn't
attempting to stabilize or standardize the pcm format.

Topic		Replies	Views
[Modules TS] Have the file formats been decided? Clang Frontend	21	143	February 7, 2017
Will Clang use PCH format to implement C++20 module system? Clang Frontend	6	111	January 8, 2020
Generating PCM (module interfaces) and regular object files from the same compiler invocation Clang Frontend	8	877	October 13, 2022
C++ modules Clang Frontend	3	110	January 3, 2012
-fmodules-ts and codegen options Clang Frontend	3	86	June 5, 2017

C++20 Modules, PCM files encoding implementation

Related topics