[C++20] [Modules] Should the BMIs contain paths to their dependent BMIs?

This is discussed in [C++20] [Modules] All BMIs are irreproducible · Issue #62269 · llvm/llvm-project · GitHub. But it is too long and not directly related to the issue. Also I am not sure if this is an issue really. So I’ll try to state and discuss it here.

Examples

For the tiny example,

// b.cppm
export module b;
export int b() {
    return 43;
}

// a.cppm
export module a;
import b;
export int a() {
    return b() + 43;
}

// user.cpp
import a;
int use() {
    return a();
}

We can compile it by:

clang++ -std=c++20 b.cppm --precompile -o b.pcm
clang++ -std=c++20 a.cppm --precompile -fmodule-file=b=b.pcm -o a.pcm
clang++ -std=c++20 user.cpp -fmodule-file=a=a.pcm -c -o user.o

In the last line of the command line, we only specify the location of a.pcm and the location of b.pcm is not specified. The command line works since we record the location of b.pcm in a.pcm. So here is what we call implicit dependencies. And we generally prefer explicit things than implicit things. Because the implicit things may give surprising results sometimes. (I guess all people here had some similar experiences)

(IIUC) @dblaikie proposes to forbid the implicit dependencies for BMIs and force we to write something like:

clang++ -std=c++20 user.cpp -fmodule-file=a=a.pcm -fmodule-file=b=b.pcm -c -o user.o

I don’t quite understand this in the beginning. Because it should be OK if the build systems can manage every thing well. But then I relate this to implicit casting. Implicit casting is fine if every one can handle it well. But I guess people who hate implicit casting is more than the people who love it.

For build systems, I can image the implicit dependencies bring some chaos to situations like caching and distributed builds. But I don’t have an insight in buildings so I’d like to raise a discussion here.

CC: @dblaikie @aaronmondal @ben.boeckel @ruoso

MSVC already requires specifying the transitive closure of modules that need importing (there are no cross-.ifc references at the .ifc level). So is Clang wants to do this too, I don’t see an issue from the CMake side at least as we already need to do (and have implemented) it for MSVC support.

The problem more comes when using precompiled BMIs from an install tree: how to strip them of the build locations. Reproducible and relocatable builds would also love to strip such things.

I believe the tradeoffs are like this:

  1. -fmodule-file=a=a.pcm -fmodule-file=b=b.pcm is “simple”, but could absolutely explode command lines if libraries aren’t laid out well. We’d potentially end up with “explicit” command lines as long as the output of “-H” invocations.

  2. -fmodule-file=a=a.pcm doesn’t have this issue, but in this case IMO the a.pcm has to contain b.pcm. If a.pcm implicitly depends on the location of b.pcm in the filesystem the build system can’t distribute builds well. Since we need b.pcms contents somewhere for a.pcm to be usable, one might initially think to just embed it in a.pcm. This would shorten command lines and make BMIs easier to move and handle in build systems. However, as we build more and more targets this would now explode our BMI sizes and we’d have a lot of duplicate information in these files. This would probably destroy remote caching setups.

So my proposal would be this:

Differentiate between primary module interfaces and module partitions.

  • Always embed partitions like a:partition in the primary BMI for a. This way the build system only needs to know the location when building a but not after that.
  • Don’t embed primary modules like b in a, but also don’t explicitly depend on the path of b in any way. Only depend on the name b. The compiler flag -fmodule-file=b=path/to/b.pcm already lets the build system handle where b comes from.
  • I’m not sure how to handle includes in BMIs, but they need to be relative in any case. Otherwise sandboxing and distribution will break. Relative paths to the resolved headers (-Iinclude leading to include/myheader.h) seems most intuitive to me.

This way we’d have reasonably short command lines that only contain primary BMIs, low binary duplication in caches and would be able to distribute (well, at least move) BMIs. BMI’s aren’t portable at the moment for many reasons, but at least this setup would remove blockers on the build system/filesystem side for BMI portability (at least those that I can think of).

It is an interesting idea to embed some BMIs into other BMIs. But I am hesitating on this since currently the size of BMI is already a concern to many people.

-fmodule-file=a=a.pcm -fmodule-file=b=b.pcm is “simple”, but could absolutely explode command lines if libraries aren’t laid out well.

Is there a known problem if the size of the command line is too long?

Which approach is easier to change in the future? I’m mostly interested in the actual usage patterns and how our decision would shape them. For example, if StdLib puts implementation details in separate modules, it is better to import them transitively (roughly how headers are included right now). As an another example, if a library A is careful about putting all implementation details into partitions, there is some value in making it the responsibility of A’s clients to depend on B explicitly.

In distributed build systems like Bazel and Buck, you have to explicitly specify all dependencies. And there may not be paths stored in the BMI or object files. I am in favour of explicitly giving all dependencies. Implicit depedendencies might cause you trouble later on.

Thanks for every one involved. It looks like the consensus is no. Then let’s track the issue in [C++20] [Modules] Don't generate paths implicitly in BMI to their dependent BMI · Issue #62707 · llvm/llvm-project · GitHub.

The ROCm build files mentioned some issues with Windows builds due to Windows not handling long command lines well, but that might have been some other edge case which is not relevant here.

I also agree on no. I think that approach is the most flexible and we can always think about embeddings and other efficiency improvements at a later stage.

There are response files. You put the command line into a file and tell the compiler: the command line is in this file. The mechanism is often used when you run into the limits of the length of the command line.

In this example in particular, user.cpp is using a declaration of int a() which doesn’t actually have its declaration reaching into module b. If the entities were pruned in the BMI, you wouldn’t have a dangling reference.

I think once we have the pruning of the BMI to include only the things reachable from exports, including the dependent entities into it will likely be a more reasonable thing to do.

But the definition of b() is required in the definition of a(). So in the generated code, I’ll expect the compiler can see the body of b() to perform any optimizations. Otherwise, there will be a performance regression.

Is an optimization that depends the definition really allowed there?

My understanding is that the definition in that case is still private to the module interface unit. IOW, a definition in the module interface is not inline, and the implementation is allowed to change without that being a breaking change.

Is an optimization that depends the definition really allowed there?

Such optimizations depends on the function bodies is called IPO (inter procedural optimization, or you can understand it as inlining simply). IPO is super important for compiler’s optimizations. The performance will go down significantly by disallowing IPO in different module units.

The performance is key lifeline for C++. I can’t image how many people would like to use modules after they know this will introduce a significant performance regression.

My understanding is that the definition in that case is still private to the module interface unit.

According to the standard, the definition here is invisible but reachable. It is different from the definitions in the private module fragment.

IOW, a definition in the module interface is not inline, and the implementation is allowed to change without that being a breaking change.

To achieve this, we must disable IPO across module units, which is not acceptable from the optimizer’s perspective.

It seems to me that “invisible” implies the optimizer shouldn’t be able to see it

This looks like a really important point to be clarified in the specification.

Then what’s the position for “reachable”? For me, I think the visibility is the thing to control accessbility in the language level just like public, protected and private in the OOP. They reflect some software engineering ideas. But if such things had became a blocker for the optimizer, I feel like it breaks the zero-overhead rules.

My understanding is that the driver for how modules were designed was to simplify the writing of the code.

If the semantics of a definition in a module interface unit means the definition is part of the interface, this needs to be called out explicitly, since that has a significant impact on how module interface units have to be written.

Again, it’s not clear to me that we have a consensus on that. I’m not opposed to it, but if that’s the case we have to make it explicit.

Yeah, explicitness is very important.


BTW, after rethinking the issue, I think it is nearly not implementable to not break the build if we only touch the non-exported non-inline entities. See the example:

// a.cppm
export module a;
int b() { ... }
export inline int a() { return b(); }
int local_use() { return a(); }

// use.cpp
import a;
int use() { return a(); }

Let’s say we changed the definition of b() after a built and we didn’t re-compile use.cpp. We just linked the changed a.cppm.o with the unchanged use.cpp.o. Note that we assume the inlining across module units is disabled but the inlining is enabled in the same module units.

Then the function body of a() in a.cppm.o may be different from the function body of a() in use.cpp.o due to the inlining. This breaks the semantics of inline. In a program, all the same inline functions in different TU should have the same definition.

So it is not enough to disable the inlining across module units yet. We need to disable inlining completely to achieve that goal, which become worse.

(Tiny note: inlining refers to compiler optimizations here while inline refers to the language feature.)


since that has a significant impact on how module interface units have to be written.

I don’t think this has an impact on writing module interface units. As programmers, we should only care about whether or not the declarations should be exported or not just like we add public, private and protected, which is basically OOP things.

The difference is if the optimizer is allowed to make assumptions that would be invalid if the definition changes.

IOW, is the definition in a module interface unit a ODR-relevant element? Or is it valid to have different translation units with different definitions?

My understanding was that the definition in a module interface unit was “invisible”, and therefore it was not a concern for ODR.

I can see the use case for making it ODR-relevant, but I don’t think the specification makes that case right now.

Again, I think that’s an interesting approach, but I don’t think this is the “letter of the law” right now. It may be worth making it so.

Yeah, it is always good to make more things more clear.

The difference is if the optimizer is allowed to make assumptions that would be invalid if the definition changes.

BTW, that will add additional jobs to optimizer to understand caches and dependencies. Although it may not be impossible, it would be hard.

I’ll be very surprised if the second ‘if’ is true.