Generating PCM (module interfaces) and regular object files from the same compiler invocation

In many (possibly most) cases the compilation of a C++20 module source will produce two artefacts:

  • a BMI (binary module interface) and
  • a regular compiled object file (implementing any required functionality for (1).

At present, we carry out two invocations of the front end to accomplish this, first we build the BMI and then we consume that in a second invocation to generate the object.

It would be more efficient if we were able to produce both from a single invocation, and at some future time that would also allow for an optimisation of the BMI content (currently it needs to contain everything necessary to produce the object, where only a sub-set of that is required to carry the module interfaces).

In support of this I have drafted some patches to implement the first stages of this but during discussion @ChuanqiXu has pointed out something that is to me highly surprising and I’d like folks to help understand it.

Given
clang++ -std=c++20 foo.cpp -c -o foo.o

and
clang++ -std=c++20 -xc++-module foo.cpp --precompile -o foo.pcm
clang++ -std=c++20 foo.pcm -c -o foo.o

I would expect the two objects to be identical (since the source is treated the same way by the FE; the module tag is only needed in the driver), however @ChuanqiXu has observed that there are differences in the handling of ODR checking.

So:

  1. Is it an original design intent that this difference is seen? (if so, some pointer to the design would be helpful)
  2. Is it “design creep”? - i.e somehow patches have added this constraint but unintended
  3. Is it a bug?

A sketch of how my intended change works is that it inserts a multiplex AST consumer that feeds the AST (as existing) to the backend and a deferred PCH writer.

It is my expectation that the AST from the Front End is complete and has relevant errors already raised. It would be very surprising to me that the act of serialising and then deserialising that AST would produce additional error checks/diagnostics (other than failures in the serialization process, of course).

Any help here will be most welcome.

Perhaps Chuanqi was referring to issues like [Modules] Incorrect ODR checks caused by `preferred_name` attribute. · Issue #56490 · llvm/llvm-project · GitHub where deserialization fails to produce an AST that matches the one that was originally serialized thereby leading to ODR check failures?

Sounds like a bug to me, besides the one pointed by @tahonermann I’ve seen many objc bugs where serialization+deserialization yield different behavior, not surprised you’ve found c++ ones.

Now I agree the behavior (different handling of ODR checking by different invocations) is a bug.


(This section is about one of the reproducer and my thought history. It may be redundant for some readers)

There was a reproducer in clang13 and I fixed it later. But I think it may be a good example.

// foo.h
template <class Base>
struct bar {
  using Ty = int;
};
template <class T>
struct foo : bar<T> {
  using typename bar<T>::Ty;
  void baz(Ty);
};
//--- X.cppm
module;
#include "foo.h"
export module X;

//--- Y.cppm
module;
#include "foo.h"
export module Y;
import X;

In clang13, when we compile Y.cppm, it will emit an odr-checking error. But if we rename Y.cppm to Y.cpp, no error message shows. This case is a false-positive example. Now if we compile Y.cppm in clang15, there is no error. But I think this is sufficient to describe the phenomenon that we get different behavior with different input kinds. (I remember I fixed the true-positive example but I didn’t take time to find one. Since I think this one may be enough.)

The reason is that we’ll do some odr checks in the deserialization part. And if we miss the odr-checking part, it won’t be surprise to have different behaviors. Why I didn’t think this is a bug before? Since I thought the design of clang is to declare modules in *.cppm files. And so we shouldn’t declare modules in *.cpp files. So it will be invalid to declare modules in *.cpp. Then the different behavior in two style won’t be a problem since there will only be one style.

However, @iains gave a compelling example in a private chat: (I thought you should post this example in this thread)

Imaging there is a `A.cpp` file and it is not a module unit.
And now we compile it and serialize it and deserialize and compiles it.
Would it make sense the ODR checking result is different with compiling it directly? (Let's focus on ODR checking now.)

Since A.cpp is not a module so my previous theory won’t stand. And now if the odr-checking behavior is different, there are only 2 possible reasons:
(1) The ODR checking in deserialization is an overkill. This is the case of the above example.
(2) The Sema lack some ODR checks. This is what I call true positive cases.

So I have to agree such inconsistency is a bug now.


Iain said it is odd that the deserilization will do odr checks. He wish to do all the odr checks in the sema part. And my thought is: if we are designing some brand new compiler, I’ll try to do the odr checks in the sema part too. Since it is about the semantics. But now for the current states of clang, I feel it is unrealistic to do such a big refactor. There’ll be a lot of changes and so many the risks. And the benefits are not so high to me. This is my feelings.

I guess someone may ask: if it is a bug, why don’t fix it? I mean if the deserilaization took an overkill odr check or the sema lack some checks, it is a bug then we should fix it in the deserialization or sema part. But the higher level design (checking odr-mismatch in deserialization) is not a bug. I prefer to call it a defect (or pity?).


@iains so now for your patches, the only high level problem (or question?) is whether or not we’ll take a special suffix to the module units. Do you wish to talk it somewhere else or in this thread? I feel this is another (although related) topic. So I feel it may be better to talk it to somewhere else.

Yeah, this is related although not directly related.

OK. SO I am encouraged that the replies at present agree that this looks wrong.

Let us try to keep this thread on one topic (it are related issues, like the filenaming mechanisms the driver uses to determine job orders, but I’d prefer to shift that discussion to a new thread).

It is true that I have a view that the deserialiser should not be doing ODR checks (that seems like a layering violation to me, and it could make fixing bugs much harder). However - let us also keep that issue separate (unless it blocks fixing the underlying bug).

@ChuanqiXu has reminded me to highlight this:

Suppose foo.cpp does not contain a module interface (so it would be incorrect to process it as a module, in some sense) but it does import at least one module.

We should not expect to require serializing/deserializing this compilation (any more than we would expect to serializing/deserializing any regular C++). If we are now missing ODR analysis, then that does force us to consider moving those checks to Sema.

I guess we could use an example that fails for LLVM-15/main.

Suppose foo.cpp does not contain a module interface (so it would be incorrect to process it as a module, in some sense) but it does import at least one module.

We should not expect to require serializing/deserializing this compilation (any more than we would expect to serializing/deserializing any regular C++). If we are now missing ODR analysis, then that does force us to consider moving those checks to Sema.

I just want to say what we do now is to add instead of moving. It doesn’t make sense to remove the odr checks in deserilization simply (I mean the engineering side).

Yes, I could see that (at least initially) we could add the missing checks to Sema.

We have discussed two bugs:

  1. round-tripping through the serialiser does not produce identical AST. That seems like a serious bug for modules (but has nothing directly to do with what we’re discussing).

  2. That the AST we serialize has missed some checks which are only applied when we deserialize it. That would block the idea being discussed here, since to implement this correctly we would require that the AST was complete, self contained and checked at the point we pass it to the back-end (so that we can also stream a copy).

Next, it seems we should look for reproducers for these two issues (I guess there are already PRs for (1)?)

Basically agreed. But I haven’t found existing PRs. I found that reproducer before since I didn’t know the module unit should end with ‘.cppm’ at the start. So then I never found the bug again.

And my feeling is (not proved) the Sema checks are relatively strong due to historical reasons (They came much earlier than we introduce modules). I remember the problem I found are mainly about concepts (new introduced) and template parameters. (And all the bugs I remembered should have been fixed).

Next, it seems we should look for reproducers for these two issues

I agree this is the correct direction. But TBH, it is relatively hard since: we have too many TODOs and it is hard to tell when to stop for such exploring. I think we can do some simple lookups in the beginning. And if we don’t find, then let’s go on and turn back again if we met the problem again.