[RFC] [C++20] [Modules] Introduce Thin BMI and Decls hash

This post is for higher level discussions for [C++20] [Modules] Introduce thin BMI by ChuanqiXu9 · Pull Request #71622 · llvm/llvm-project · GitHub and [C++20] [Modules] Bring Decls Hash to BMI for C++20 Module units by ChuanqiXu9 · Pull Request #71627 · llvm/llvm-project · GitHub. Since both of the PR are large and I think people may have higher-level comments first. And it looks not good if the PR mixed the higher level comments and low-level comments for details. Also posting them here may be good to have better visibilities.

The post covers both thin BMI and decls hash. But the sections should be skippable.

Thin BMI

Let’s call the current BMI for named modules as fat BMI. Since it contains all the information needed to compile the BMI itself to an object file. And this is the foundation of 2 phase compilations.
(In [C++20] [Modules] Introduce thin BMI by ChuanqiXu9 · Pull Request #71622 · llvm/llvm-project · GitHub, there are opinions about the terminologies)

However, it is a waste of time and space if we’re using one phase compilations. Especially both GCC and MSVC choose to implement one phase compilations.

Then let’s call the BMI which doesn’t contain all the information to produce an object file, but is able to satisfy the requirement of consumers of the modules as thin BMI. Then it is clear that we should use thin BMI then fat BMI in one phase compilation.

Also in 2 phase compilations, the thin BMI may be helpful too if the build systems want to use a compare-and-swap strategy to generate the BMI based on the BMIDeclsHash.

Interfaces

The designed interfaces are -fthinBMI-output= flag to specify the position of thin BMI. This should be used when compiling a single module unit.

The design is helpful to use thin BMI in two phase compilations too. With thin BMI, In two phase compilations, we’ll generate 2 BMIs, one thin BMI for being used by consumers, one fat BMI for compiling itself to object files. Maybe it sounds confusing to have 2 BMIs for one module unit. But only the thin BMI will be the BMI we’re talking about generally and the fat BMI is only visible by the module unit itself.

With one phase compilation, we may find the behavior of -fthinBMI-output= is pretty similar with -fmodule-output=, except one generating thin BMI and the other generating fat BMI. The design here is based on 2 things:
(1) The serialization of C++ is pretty complex. We can’t be sure we’re handling every detail correctly in the every beginning.
(2) The fat BMI is relatively widely used and relatively stable. So it looks not good to replace the fat BMI immediately with thin BMI.

But, of course, in the end of the day, we want the consumers to use the thin BMI only. When that day comes, the -fmodule-output= will be an alias to -fthinBMI-output=.

Another design choice is to reuse -fmodule-output= and introduce a flag -femit-thin-BMI. Then -femit-thin-BMI -fmodule-output= will have the same effect with -fthinBMI-output= now.
The flag -femit-thin-BMI should be opt-in now and opt-off later and finally deprecated.

Roadmap

The roadmap for thin BMI in my mind is:

(1) In clang18, release thin BMI and mark it as experimental. Also encourage users and build systems to try this new mode. (2) In clang19 or clang20 (based on the issue feedbacks), remove the experimental mark for thin BMI and mark fat BMI as deprecated to be used by consumers.
(3) In clang21 or clang22, error out if we found the users are trying to import a fat BMI.

Decls Hash

Motivating Example

The motivating example is:

// a.cppm
export module a;
export int a() {
    return 43;
}

// use.cc
import a;
int use() {
    return a();
}

After we change the implementation of a() from return 43; to return 44;, we can avoid recompiling use.cc to use.o since the interface doesn’t change.

This is a pretty appealing feature.

Interfaces

The interface introduced in this patch is:

  • Every BMI for C++20 named modules will contain a hash value recording all the parts of the decls that may affect the consumers.
  • Users (generally build systems) can query the information by -module-file-info flag or -get-bmi-decls-hash flag.

For example,

$ clang++ -module-file-info Hello.cppm
Information for module file 'Hello.pcm':
  Module format: raw
  ====== C++20 Module structure ======
  Decls Hash: e414edc5d1e1c721
  Interface Unit 'Hello' is the Primary Module at index #1
   Sub Modules:
    Global Module Fragment '<global>' is at index #2
  ====== ======
  Generated by this Clang: (git@github.com:llvm/llvm-project.git 561b8a1ac2a94761a9bf190c6ad2b8785ce9e072)
  Module name: Hello
  Language options:
    C99: No
    C11: No
    C17: No
    C23: No
    Microsoft Visual C++ full compatibility mode: No
    Kernel mode: No
    Microsoft C++ extensions: No
    Microsoft inline asm blocks: No
    Borland extensions: No
    C++: Yes
    C++11: Yes
    C++14: Yes
    C++17: Yes
    C++20: Yes
    C++23: No
    C++26: No
    ...

or

$clang++ -get-bmi-decls-hash Hello.pcm
Decls Hash: e414edc5d1e1c721

The difference is that -module-file-info may bring more information like compiler versions and compilation flags but -module-file-info may be slower. And -get-bmi-decls-hash focus on the decls hash only and it should be faster.

Note that the value doesn’t contain information like compiler versions and compilation flags, just as its name shows. The thought is that the 2 informations should be already known by the build systems already.

Usages

The decls hash allow the build systems to use a compare and swap strategy when generating BMI. For example, the build system can generate the BMI to a temporary place, (and if the compiler version and compilation flags doesn’t change), the build system can try to query and compare the decls hash value for the generated BMI and the existing BMI. And only replace the existing BMI if their hash value differs.

Then we should be able to make the motivating examples.

Why not generating 2 same BMIs directly

People may feel the process is complicated. Why not generating the same BMI at the very beginning? Then we can use diff or md5sum simply. The answer is that we can’t. We generate too many things in the BMI and we just serialize the source file as is. The most simple example is, the generated BMI is different for:

export module a;

export int a() { return 43; }

and

export module a;
export int a() { return 43; }

Just because the function a() has different source locations.

The strategies of computing decls hash

Currently the strategy is simple, we don’t count the definition of an non-inline non-dependent function or variables. And we don’t count the non-exported non-inline non-dependent entities completely.

For example,

export module a;
export int a() { return 43; }

and

export module a;
int unexported() { return 44; }
export int a() { return unexported(); }

will have the same decls hash value.

We can find more examples in [C++20] [Modules] Bring Decls Hash to BMI for C++20 Module units by ChuanqiXu9 · Pull Request #71627 · llvm/llvm-project · GitHub

An interesting case may be inline functions. The ABI says the inline functions will be generated in the unit that calling them. So we choose to update the BMI hash every time we find the inline functions changes. But the strategy may be too conservative. For example,

export module a;
inline int func() { return 43; }
export int a() { return func(); }

and

export module a;
inline int func() { return 44; }
export int a() { return func(); }

These two units can have the same decls hash since the inline function func() is not reachable to the consumers. But we choose to generate different decls hash for simpler implementation. Technically, it is possible to make it by implementing a context sensitive reachable analysis. But let’s leave it to future.

The key point here is our strategy to treat such cases. I think:

  • It is a bug if we produce the same decls hash value for 2 different module units that can’t be ABI compatible to consumers.
  • But it is only an improvement chance if we produce different decls hash for module units that can be ABI compatible to consumers.
4 Likes

I’m not sure that the build system can skip that step. It would have to recompile because the timestamps changed.

I do think having a hash to know when to invalidate an object and/or BMI is an interesting feature, though. We do something similar in our C++ code generation workflows in our build systems.

Note that Ruoso wrote a paper to the ISO C++ Tooling Study Group (SG-15) a while ago about a similar concept – hashing command lines on their own and then using those hashes to name BMI files. This would allow the build system to use filenames and file timestamps (i.e., how make and ninja work) to invalidate and reuse BMIs as needed. And, importantly, it would make it possible for clang to disregard flags that would not affect a thin BMI like diagnostic output flags. Right now, build systems either have to be pessimistic and build many redundant BMIs just in case, or they have to try to hardcore details about all the flags for all the compilers they support.

I think he means the build system can look at the hash and not replace the BMI (.pcm) file if the hash didn’t change. So the timestamp on the BMI wouldn’t change.

I have my own build system and also wouldn’t mind directly looking at the hash and using that instead of a file timestamp when considering .pcm files.

1 Like

I look forward to trying this out, and I will do so as soon as it is merged into the main branch. I’ve been trying to use C++20 modules, and the needless recompilation issue has been a problem. This feature looks like it will solve it.

I’m not sure the declhash path is viable (how do other implementations work in these cases?) Because diagnostic experiences would be based on the source locations - so it’s meaningfully different to report an error on line 2 compared to reporting it on line 3. So it’d be important to rerun builds for warnings, errors, DWARF, etc, based on changes in source locations, I think?

3 Likes

Indeed, unless we’re literally just going to boil down to function F in module M has a problem…go find it, just about any byte change can be a candidate for these things. I am of the opinion that the BMI can be remade unconditionally and cache tools save you the downstream work (basically, short circuit the compiler across builds, not just within a single build while trying to dance around mtime shenanigans).

There is probably a better selection of names here (“thick” and “thin” were used previously in discussion about BMIs referring to dependent BMIs and being loaded automatically). It also is not indicative of the relative usages possible.

Suggested on Github:

  • “interface BMI” vs. “implementation BMI” (confusing when expanded)
  • “BMInterface” vs. “BMImplementation”

Two that come to my mind include “skeletal BMI” vs. “full BMI” or “minimal BMI” vs. “precompiled BMI”.

Another one that came up on github was “Surface BMI” for the thin case; it’s not my suggestion but I think the term is apt. You can imagine a 3D object (the full BMI) and a picture[1] or blueprint of it – both represent the same thing, but one is unnecessarily heavy and only necessary for the final construction. All other users would be fine with only the picture/blueprint.

I think it would be helpful if these did not share the same acronym. If we go into to the original definition of BMI (Built Module Interface), then a couple of the suggestions become already less useful IMO, as an “Interface” is something that’s hard to reasonably qualify with thin/fat etc.

Perhaps it would make sense to distinguish BMI (fat; as before) from BMS (Built Module Surface; thin). Anyway, just some musings on bikeshed colours.


  1. looking at it just from one angle/side, or similarly, just its surface rather than its interior ↩︎

I think he means the build system can look at the hash and not replace the BMI (.pcm) file if the hash didn’t change. So the timestamp on the BMI wouldn’t change.

Yes.

I look forward to trying this out, and I will do so as soon as it is merged into the main branch. I’ve been trying to use C++20 modules, and the needless recompilation issue has been a problem. This feature looks like it will solve it.

If you’re interested, I think you can try out just now based on the branch of the PR. I’ve tried it works for simple Hello World Examples alraedy.

Oh, nice catch! The dwarf information is killing here. Since it is part of the final product. Then it is different from warnings and errors. And if the dwarf information about source location is incorrect, it implies that we can’t make breakpoint correctly based on the sources on the real filesystem.


(following off is some thoughts about warnings and errors)

It is indeed a problem about warnings and errors and I just tried that out. (I didn’t try the invalid examples before).

For

// Hello.cppm
export module Hello;
export void hello() {
    
}

// Use.cpp
import Hello;
int use() {
    hell(); // mis spelled
}

The diagnostic message is:

Use.cpp:3:5: error: use of undeclared identifier 'hell'; did you mean 'hello'?
    3 |     hell();
      |     ^~~~
      |     hello
/home/chuanqi.xcq/llvm-project-for-work/build/HelloWorld/Hello.cppm:2:13: note: 'hello' declared here
    2 | export void hello() {
      |             ^
1 error generated.

And we change the source of Hello.cppm to

export module Hello;

void a() {}

export void hello() {
    a();
}

then the error message becomes:

Use.cpp:3:5: error: use of undeclared identifier 'hell'; did you mean 'hello'?
    3 |     hell();
      |     ^~~~
      |     hello
/home/chuanqi.xcq/llvm-project-for-work/build/HelloWorld/Hello.cppm:3:12: note: 'hello' declared here
    3 | void a() {}
      |            ^
1 error generated.

Yeah, it is simply out of order. But I still want to seek possible solutions (or workaround) for this since this is really an appealing feature.

Then the possible solutions in my mind may be:
(1) Just ignore the difference? And the programmers can get the correct location by clearing the corresponding BMIs in their build dir.
(2) Always generate the BMI (instead of compare and swap) but record the decl hash values of required modules in consumers. Then we can avoid recompiling the consumers if its sources doesn’t change and all the decl hash values of the required modules doesn’t change. This is more like things we want to do in ccache.

For example, we’ve compiled use.cc to use.o and we recorded the hash values of required modules of use.cc is {0x123456}. Then we change Hello.cppm and we generated new Hello.pcm. Then before we compile use.cc, we found the source of use.cc doesn’t change the hash values of required modules keep the same. We can skip the compilation of use.cc.
(3) Embed the source files into the BMI. Then no matter how we change the source location of hello(). The diagnostic message will always point to the declaration of hello(). While the printed line number is still different with the ones in real filesystems, the diagnostic messages are much more clear than it is.

How do you guys think about the issue?

For the naming’s discussion, from the position of a non-native speaker, I think the pair (“interface BMI” vs. “implementation BMI”) is most easy to understand for me.

We’re talking about C++ here, so we should clearly prioritize performance over correctness :slight_smile:

With regard to (1), we should avoid adding yet more cases that motivate developers to try make clean as a solution to their problems.

I like (2) in principle though I worry that extracting and comparing the hashes will impose additional overhead for what I expect to be the common case; where the hash differs because a change to the interface was made. That might be an unfounded concern though.

Option (3) seems to be a band-aid that works around one particular manifestation of the root problem, but I imagine that there are others.

Can we perhaps just brand these differently? Perhaps the thin/interface BMI becomes the BMI and the fat/implementation BMI just becomes a TU/AST?

With regard to (1), we should avoid adding yet more cases that motivate developers to try make clean as a solution to their problems.

make clean is bad. I meant to remove the BMI simply. Either removing the BMI by hand rm <path/to/Hello.pcm> or require the build system to offer a command cmake ... --remove-bmi-for a.cppm

Can we perhaps just brand these differently? Perhaps the thin/interface BMI becomes the BMI and the fat/implementation BMI just becomes a TU/AST?

Ideally, this will be the case in the end of the day. But with the consideration of stability, we won’t replace fat BMI with thin BMI immedaitely. According to my roadmap, there may be 2 years to make fat BMI out of the scope. Then we need names before that.

#2 sounds like the best option to me. Unless there is a way to generate the debug or error reporting info to the side, so that it can be new, without the interface .pcm file being new. I don’t know much about this works and am just looking at this as a “user” of C++.

I wonder how GCC and MS do it?

(The following is probably not realistic for Clang. I’m just wondering.) … I’ve thought before about a compiler that would abandon file-based granularity and instead work on a SQLite database of compiled objects. The entire process would pull through a fine-grained DAG of dependencies, at the level of individual names, functions, types, etc. It would be a different compiler architecture I guess, but I wonder if it would allow significantly faster builds in daily work.

I don’t think this direction (semantic hash to reduce rebuilds) is viable - we need deterministic results that don’t differ based on the history of incremental rebuilds.

Best, I think, we can do is to ensure as little information is in the BMI in the first place, and that it’s canonicalized as much as possible. That still means, yes, if you add a new comment line, shifting source lines down, everything does have to rebuild.

One way you could go further would be to separate out some of this information, and only use it later - like if some of the line number information went into a separate file and was only consumed to render diagnostics (like the compiler fails, produces some json file or something - that the build system then needs to feed in the source line information to translate the json into diagnostics) - then the line information isn’t an input to the compilation. But how do you do a -g build? It’d be unfortunate if the inputs to a build action depended on flags like that…

So, I don’t really see a viable direction here.

That still requires a developer to know that their issue is caused by a stale BMI. The developer would have to have a good understanding of how support for modules is implemented and how their build system works to diagnose that the problem is in any way related to BMIs.

Since the “I” in BMI stands for “interface”, I think “interface BMI” and “implementation BMI” don’t make a lot of sense. How about “BMI” for the minimal variant (since that matches what is desired long term; “minimal” is the implicit default) with an additional “maximal/complete/verbose/full” qualifier for the variant that includes the full AST?

My understanding is that IBM Visual Age version 4 had something like that. It failed in the market place because of the lack of integration with existing technologies and products (e.g., build systems and version control systems). I have no direct experience with it though.

If I understand option (2) correctly, the result is still deterministic; the hash is just used to avoid building objects with (semantically) the same inputs as they were last built with. But perhaps you have another perspective on what is or would not be deterministic.

What are the other compilers doing, like MS or GCC? Do you have to create thing.cppm and thing_impl.cppm to cut down on rebuilds? Or can one write only thing.ccpm and somehow (interface hashing, or ??) the compiler and build system take care of it?

I thought modules were going to allow just thing.cppm, which I was looking forward to. I know it’s not the #1 module feature (which is ending the horrors of textual #include).

I also like that a potential hashing design could be better than headers at isolating the interface, because C++ forces you to put non-interface things, like private member functions, in a header file. But a hashing scheme could leave those out.

Sounds good to me : )

(The following is probably not realistic for Clang. I’m just wondering.) … I’ve thought before about a compiler that would abandon file-based granularity and instead work on a SQLite database of compiled objects. The entire process would pull through a fine-grained DAG of dependencies, at the level of individual names, functions, types, etc. It would be a different compiler architecture I guess, but I wonder if it would allow significantly faster builds in daily work.

This is an interesting idea. But it requires us to design a language to communicate between build systems and the compilers. Maybe it is possible but I don’t know how.

The hash value are used to identify the inputs are “semantical” equivalent. But what @dblaikie concerns is about source locations. While option(2) can solve the incorrect source locations in warnings/errors, IIUC it can’t solve the incorrect source locations in debug information for inline functions. (Since the inline functions need to be generated in the callee’s unit.)

@dblaikie I am curious about the behavior of DWARF for inline functions. For example,

// foo.h
// ...
inline  int func() { ... }

// bar.h
// ...
inline  int func() { ... } // same implementation with the one in foos

// a.cpp
#include "foo.h"
...
func();

// b.cpp
#include "bar.h"
...
func();

The program should be valid AFAIK. But how will the debugger handles func()? Since func() have different source locations in different files.

Also, in my memory, there are cases we can’t print informations in gdb. e.g., things got optimized out. Then if the debug information doesn’t require us to print every thing, does it make sense to not generate debug information for inline functions coming from modules?

What are the other compilers doing, like MS or GCC?

I didn’t heard they doing the similar things. I’d like to ask them when I had the chance.

This seems relevant. It’s about the “private module fragment” that can exist in a primary interface unit: [module.private.frag] .

Consumers of that interface would certainly want to depend on something (a hash, or thin BMI, or whatever) that doesn’t change when the private code changes.

Yep, anything where an incremental build produces a different result from a full rebuild is a bug - and this feature, at least as currently designed, would necessitate such behavior and would not be acceptable.

I think “BMI” for the minimal variant is still going to be a source of confusion since we’ve been using the term to refer to the non-minimal version for a while now & we can’t really assume every author and reader is on the same page about this transition in terminology.

& given that history, it seems like “interface BMI” and “implementation BMI” still make sense/accurately reflect the difference between these two things in a way that’s more clear than “full/complete/maximal/minimal/thin/slim/etc”?

The issue is that semantic equivalence isn’t quite enough - line numbers are part of the behavior of the compiler (for diagnostics and for debug info, most notably). So running a clean build and running an incremental build of the same source (where the incremental build was built against semantically equivalent but syntatically different - an extra blank line, etc).

It’s deterministic in the sense that “given all the inputs, including the previous compilation’s output, the hash, the previous module definition, the current module definition” the output is always the same. But given only the same source code, the output is not deterministic - it depends on the previous state of the build (that you built at a previous semantically equivalent but syntatically nonequivalent revision). An incremental build that isn’t identical to a full rebuild is a generally bad thing - not something we would accept.