This post is for higher level discussions for [C++20] [Modules] Introduce thin BMI by ChuanqiXu9 · Pull Request #71622 · llvm/llvm-project · GitHub and [C++20] [Modules] Bring Decls Hash to BMI for C++20 Module units by ChuanqiXu9 · Pull Request #71627 · llvm/llvm-project · GitHub. Since both of the PR are large and I think people may have higher-level comments first. And it looks not good if the PR mixed the higher level comments and low-level comments for details. Also posting them here may be good to have better visibilities.
The post covers both thin BMI and decls hash. But the sections should be skippable.
Thin BMI
Let’s call the current BMI for named modules as fat BMI. Since it contains all the information needed to compile the BMI itself to an object file. And this is the foundation of 2 phase compilations.
(In [C++20] [Modules] Introduce thin BMI by ChuanqiXu9 · Pull Request #71622 · llvm/llvm-project · GitHub, there are opinions about the terminologies)
However, it is a waste of time and space if we’re using one phase compilations. Especially both GCC and MSVC choose to implement one phase compilations.
Then let’s call the BMI which doesn’t contain all the information to produce an object file, but is able to satisfy the requirement of consumers of the modules as thin BMI. Then it is clear that we should use thin BMI then fat BMI in one phase compilation.
Also in 2 phase compilations, the thin BMI may be helpful too if the build systems want to use a compare-and-swap strategy to generate the BMI based on the BMIDeclsHash.
Interfaces
The designed interfaces are -fthinBMI-output=
flag to specify the position of thin BMI. This should be used when compiling a single module unit.
The design is helpful to use thin BMI in two phase compilations too. With thin BMI, In two phase compilations, we’ll generate 2 BMIs, one thin BMI for being used by consumers, one fat BMI for compiling itself to object files. Maybe it sounds confusing to have 2 BMIs for one module unit. But only the thin BMI will be the BMI we’re talking about generally and the fat BMI is only visible by the module unit itself.
With one phase compilation, we may find the behavior of -fthinBMI-output=
is pretty similar with -fmodule-output=
, except one generating thin BMI and the other generating fat BMI. The design here is based on 2 things:
(1) The serialization of C++ is pretty complex. We can’t be sure we’re handling every detail correctly in the every beginning.
(2) The fat BMI is relatively widely used and relatively stable. So it looks not good to replace the fat BMI immediately with thin BMI.
But, of course, in the end of the day, we want the consumers to use the thin BMI only. When that day comes, the -fmodule-output=
will be an alias to -fthinBMI-output=
.
Another design choice is to reuse -fmodule-output=
and introduce a flag -femit-thin-BMI
. Then -femit-thin-BMI -fmodule-output=
will have the same effect with -fthinBMI-output=
now.
The flag -femit-thin-BMI
should be opt-in now and opt-off later and finally deprecated.
Roadmap
The roadmap for thin BMI in my mind is:
(1) In clang18, release thin BMI and mark it as experimental. Also encourage users and build systems to try this new mode. (2) In clang19 or clang20 (based on the issue feedbacks), remove the experimental mark for thin BMI and mark fat BMI as deprecated to be used by consumers.
(3) In clang21 or clang22, error out if we found the users are trying to import a fat BMI.
Decls Hash
Motivating Example
The motivating example is:
// a.cppm
export module a;
export int a() {
return 43;
}
// use.cc
import a;
int use() {
return a();
}
After we change the implementation of a() from return 43; to return 44;, we can avoid recompiling use.cc to use.o since the interface doesn’t change.
This is a pretty appealing feature.
Interfaces
The interface introduced in this patch is:
- Every BMI for C++20 named modules will contain a hash value recording all the parts of the decls that may affect the consumers.
- Users (generally build systems) can query the information by
-module-file-info
flag or-get-bmi-decls-hash
flag.
For example,
$ clang++ -module-file-info Hello.cppm
Information for module file 'Hello.pcm':
Module format: raw
====== C++20 Module structure ======
Decls Hash: e414edc5d1e1c721
Interface Unit 'Hello' is the Primary Module at index #1
Sub Modules:
Global Module Fragment '<global>' is at index #2
====== ======
Generated by this Clang: (git@github.com:llvm/llvm-project.git 561b8a1ac2a94761a9bf190c6ad2b8785ce9e072)
Module name: Hello
Language options:
C99: No
C11: No
C17: No
C23: No
Microsoft Visual C++ full compatibility mode: No
Kernel mode: No
Microsoft C++ extensions: No
Microsoft inline asm blocks: No
Borland extensions: No
C++: Yes
C++11: Yes
C++14: Yes
C++17: Yes
C++20: Yes
C++23: No
C++26: No
...
or
$clang++ -get-bmi-decls-hash Hello.pcm
Decls Hash: e414edc5d1e1c721
The difference is that -module-file-info
may bring more information like compiler versions and compilation flags but -module-file-info
may be slower. And -get-bmi-decls-hash
focus on the decls hash only and it should be faster.
Note that the value doesn’t contain information like compiler versions and compilation flags, just as its name shows. The thought is that the 2 informations should be already known by the build systems already.
Usages
The decls hash allow the build systems to use a compare and swap strategy when generating BMI. For example, the build system can generate the BMI to a temporary place, (and if the compiler version and compilation flags doesn’t change), the build system can try to query and compare the decls hash value for the generated BMI and the existing BMI. And only replace the existing BMI if their hash value differs.
Then we should be able to make the motivating examples.
Why not generating 2 same BMIs directly
People may feel the process is complicated. Why not generating the same BMI at the very beginning? Then we can use diff
or md5sum
simply. The answer is that we can’t. We generate too many things in the BMI and we just serialize the source file as is. The most simple example is, the generated BMI is different for:
export module a;
export int a() { return 43; }
and
export module a;
export int a() { return 43; }
Just because the function a()
has different source locations.
The strategies of computing decls hash
Currently the strategy is simple, we don’t count the definition of an non-inline non-dependent function or variables. And we don’t count the non-exported non-inline non-dependent entities completely.
For example,
export module a;
export int a() { return 43; }
and
export module a;
int unexported() { return 44; }
export int a() { return unexported(); }
will have the same decls hash value.
We can find more examples in [C++20] [Modules] Bring Decls Hash to BMI for C++20 Module units by ChuanqiXu9 · Pull Request #71627 · llvm/llvm-project · GitHub
An interesting case may be inline functions. The ABI says the inline functions will be generated in the unit that calling them. So we choose to update the BMI hash every time we find the inline functions changes. But the strategy may be too conservative. For example,
export module a;
inline int func() { return 43; }
export int a() { return func(); }
and
export module a;
inline int func() { return 44; }
export int a() { return func(); }
These two units can have the same decls hash since the inline function func()
is not reachable to the consumers. But we choose to generate different decls hash for simpler implementation. Technically, it is possible to make it by implementing a context sensitive reachable analysis. But let’s leave it to future.
The key point here is our strategy to treat such cases. I think:
- It is a bug if we produce the same decls hash value for 2 different module units that can’t be ABI compatible to consumers.
- But it is only an improvement chance if we produce different decls hash for module units that can be ABI compatible to consumers.