[RFC] [C++20] [Modules] Introduce Thin BMI and Decls hash

But requires the build system to read previous output files produced by the compiler and depend on them as inputs to the next build? (not dissimilar to the compare-and-swap approach, but that’s usually handled by the build system, not the compiler itself - having the compiler reading previous inputs would be a pretty significant departure/tool change)

And only applies when the previous build succeeded - and there wouldn’t be a way for the build system to know the previous build succeeded, right? maybe these objects are stale? (I guess that’d work out OK if we built successfully, then the build failed, and the next build happened to be a hash hit - means we changed the code back to something semantically equivalent)

There’s a few complications here - if the two copies of func get optimized differently, the linker will pick one and tombstone/zero-out the addresses in the debug info that references the removed copy. (because the debug info could describe variables as being in different locations, etc)

If the machine code/.text section for the functions are identical, then the debug info in both cases refers to the that one copy - so you’d have one copy in one DWARF CU that refers to the function and says it’s on line 5, and another that refers to the same function and says it’s on line 7. This isn’t particularly useful/necessary, but isn’t wrong either.

It’d still give a poor user experience if the DWARF was out of date and you couldn’t break on the function and get to the right line of source, or couldn’t step through it and see the right lines of code, etc.

Possible, but probably a poor user experience - the user wouldn’t be able to step into this code, print variables, etc. That seems pretty unfortunate.

Summary for decls semantical hash:

  1. The compare-and-swap approach is not good since the source locations in diagnostic messages may be out-of-date.
  2. The option#2 (skip a compilation job if its content and the required hash values don’t change) will require the build system to know the previous build status. Will this be possible? CC: @ben.boeckel
  3. (Based on option#2), for inline functions from modules, users may get a poor debugging user experience. But it isn’t wrong.

It looks like the point 2 may be the blocker to me.


For the naming for thin/fat BMI, I still prefer implementation/interface BMI since it looks easy to understand.

[apologies for not responding sooner, WG21 and associated commitments are making it hard]

At the moment my main problem with this thread is that it discusses a proposed solution without making clear what the underlying requirement is. If we make sure to specify the requirement clearly, then it might assist in identifying other potential solutions.

Elsewhere, I have proposed that we use an AST multiplex consumer to split the paths between the code gen (requires “full” content) and the BMI (interface-only) output, with the intention of applying AST transforms on the second path to implement the ‘thinning’.

With that style of production (which mimics what GCC, at least, does) there is no need to try and track an intermediate form.

1 Like

To this end I would suggest there are actually two requirements being discussed here;

  1. That interface BMIs need to contain a reduced content c.f. the AST that is used to generate the object files. This is both from a correctness point of view (e.g. that non-inline function bodies should not be present in the interfaces) and performance (e.g. that unreferenced GMF decls should be elided from the interface to minimise merging effort in the consumer). These are examples - not a statement of a complete list.

  2. That it is desirable that build systems have a mechanism to avoid rebuilding interface BMIs when it is not necessary.

As noted above, some implementations of (1) might obviate the need for (2) - although some simple mechanism for determining the unchanged content could still be valuable.

Does this synopsis seem correct? or if not what additional requirements or changes should we note?

Elsewhere, I have proposed that we use an AST multiplex consumer to split the paths between the code gen (requires “full” content) and the BMI (interface-only) output, with the intention of applying AST transforms on the second path to implement the ‘thinning’.

I feel like this is an implementation detail. Or no one is against “pruning” the BMI if possible.

With that style of production (which mimics what GCC, at least, does) there is no need to try and track an intermediate form.

Maybe in the end of the day, clang may do that. But given the current implementation is relatively stable, it is not good to turn this too quickly. And this is the described in the road map section. So that the current implementation BMI will exist for a relative long time.

some implementations of (1) might obviate the need for (2)

No. The most discussed thing in the post is source locations. As long as we put the source locations to the interface BMI, we need to face the same problem discussed above. And I guess no one is against putting source locations to the interface BMI, right? (Otherwise how can we generate diagnostic messages and debug info?)


BTW, how is this handled in GCC? Will GCC put source locations to their CMI? And will GCC try to avoid recompiling after we change the source locations of the source of modules?

For content-hash-based executions, yes. CMake doesn’t support any today. They can be emulated with smarter “overwrite if different” tools to only conditionally update the mtime in conjunction with ninja’s restat = 1 feature.

I’m worried about people getting confused between interface/implementation BMIs and the existence of module interface and implementation units. And as previously mentioned, the “I” in “BMI” already stands for “Interface”. “Implementation” also seems a little strange to me; the property that is interesting is the ability to perform codegen from the (complete) BMI. We use .ast files for this ability elsewhere today.

I still prefer “minimal” and “complete” if qualifiers are really needed. This is all still new enough that I think backward compatibility isn’t a particularly large concern at this point. Of course, anyone that feels differently should refute that, preferably with data or experience.

So can I understand that this is impossible at least now for cmake?

I’m thinking how many people may have data or experience in this field. To me, I feel like interface/implementation BMI explains its functionality. But minimal/complete only describes the property and I still need to search for something to understand its goal and rationale. Maybe I am not a native speaker, so I don’t feel bad with the redundant meanings of I in BMI. I guess maybe we can only get a consensus on this by a poll in the end of the day : )

The underlying tools don’t support it widely, so it’s not a guarantee CMake can offer.

I think it’s premature at this point to attempt to invent a solution to source-locations which necessarily appear in BMI outputs. I recommend to put that issue aside (for now) and NOT introduce a decls hash mechanism, only continue with the “Thin BMIs” work.

However, I do think it would be worthwhile to look into reducing the impact of BMI changes on transitive dependencies.

Currently, a no-op change to a module affects not only its own BMI output file, but also everything transitively using it. Everything up the chain must be rebuilt, even assuming a build-system or compiler-cache which uses content hashes (of which many do exist).

E.g. given:

cat > test.cppm <<EOF
export module test;
export int test() {
    return 0;
}
EOF

cat > test2.cppm <<EOF
export module test2;
import test;

export int test2() {
    return test();
}
EOF

cat > main.cc <<EOF
import test2;
int main() { return test2(); }
EOF

One might hope that main.cc could be compiled without needing to load the test module at all, and thus, be insensitive to changes in it. After all: nothing from test is ever part of an exported decl of the imported test2 module. Alas, this is not the case – you do need to provide both test and test2 to compilations of main.cc. E.g. with:

clang -std=c++20 test.cppm --precompile -o test.pcm
clang -std=c++20 test2.cppm --precompile -o test2.pcm -fmodule-file=test=test.pcm
clang -std=c++20 -c main.cc -o main.o -fmodule-file=test=test.pcm -fmodule-file=test2=test2.pcm

Maybe the unnecessary import can be fixed as well via the “thin BMI” effort?

Furthermore, the contents of the file test2.pcm actually differs with trivial modifications to test.cppm. E.g.

mv test.pcm test-old.pcm
mv test2.pcm test2-old.pcm

echo >> test.cppm

clang -std=c++20 test.cppm --precompile -o test.pcm
clang -std=c++20 test2.cppm --precompile -o test2.pcm -fmodule-file=test=test.pcm

Sure, test-old.pcm and test.pcm differ here, but test2-old.pcm and test2.pcm seem like they really should not differ! Looking into why, with llvm-bcanalyzer --show-binary-blobs -dump they differ only in the field source-location-offset of MODULE_OFFSET_MAP. That seems like purely an internal implementation detail of how Clang’s precompiled-header code was originally built…and probably could be fixed.

I think it’s premature at this point to attempt to invent a solution to source-locations which necessarily appear in BMI outputs. I recommend to put that issue aside (for now) and NOT introduce a decls hash mechanism, only continue with the “Thin BMIs” work.

Yes, agreed. And if you’re interested, there are additional discussion in SG15 mailing list.

Maybe the unnecessary import can be fixed as well via the “thin BMI” effort?

For this specific example, probably yes. However, the story changes when we think about templates. As long as a template in module test2 uses a template from module test, the template from module test becomes a true dependency to users of module test2. Then we can’t ignore module test for main.cc.

Now my thought is: the templates (and inline functions) are the devil of isolations… and possibly we can’t get rid of that.

I think these concerns also apply to inline variables (and constexpr variables in general?). Also data member initializers and defaulted default constructors (which I guess is just a specific case of an inline function).

Yes! Sorry for confusion.

  • thin/fat BMI
  • interface/implementation BMi
  • BMInterface/BMImplementation
0 voters

A poll for naming conventions.

“fat” can be interpreted negatively. I voted for the thin/fat option, but would prefer something like thin/full, thin/complete, or thin/wide.

1 Like

Hmm… good point, thin and fat are both probably non-inclusive language and something like ‘reduced/complete’/etc is probably necessary.

4 Likes

I voted for interface/implementation, but reduced/complete sounds fine to me.

1 Like

Oh nice suggestion. Let’s try to make new polls to show the suggestion. Let’s vote for ‘thin BMI’ and fat BMI seperately.

  • thin BMI
  • reduced BMI
  • interface BMI
  • BMIInterface
0 voters
  • fat BMI
  • full BMI
  • complete BMI
  • wide BMI
  • implementation BMI
  • BMImplementation
0 voters

I hope we won’t end up with an incoherent pair of terms voting this way.
If we do, I suggest to have another one, where winners are used to form coherent pair.

Or this vote can be redone with new pairs and multiple votes per person.

2 Likes