[RFC] [Modules] Should we embed sources to the BMI?

ChuanqiXu · September 5, 2024, 2:41am

I think that if GCC isn’t going to do it, build systems are going to have to move the code anyway,

It sounds like GCC is the golden rule, which is more or less odd. The number of build systems are much more than the number of compilers. I think it is arbitrary to say they won’t support it. Actually, we’ve support it in our downstream build systems. It is pretty beneficial for the sandboxed build. We’d like to contribute it if the preconditions are met. And I believe it will be helpful for distributed build too.

I guess I only see this being the case for custom build systems, because if GCC or MSVC doesn’t do this too, no one is going ot use this feature.

I can’t agree with this. I know some projects are using clang only. I believe such projects won’t be too rare.

so this ends up being a rarely/never used functionality that we have to test and maintain.

In another perspective, if this is really a rarely or never used functionality, we should remove the functionality directly. But it is not the case. We’ve already used it. And I think Google are using it too. Is it the case?@dblaikie @ilya @zygoloid

Which github issue do you mean?

It is listed in the first paragraph of the post.

That said, I would probably wish that we spend time fixing THAT requirement, or finding a valid fallback in those situations for when the file doesn’t exist.

Technically agreed. But given it is too fundamental, I don’t know how to do it cleanly. As I said in the post, if we choose this way forward, I do think it is the same choice to stuck us in this direction.

More concretely, some issues are in ASTReader. I’ve fixed some of them. And some are in the source manager, (I found this in downstream):

github.com

llvm/llvm-project/blob/c28b1a19aadff97b369889aee084073a181cfda8/clang/lib/Basic/SourceManager.cpp#L125


      
            return std::nullopt;
          if (Buffer)
            return Buffer->getMemBufferRef();
          if (!ContentsEntry)
            return std::nullopt;
          
          // Start with the assumption that the buffer is invalid to simplify early
          // return paths.
          IsBufferInvalid = true;
          
          auto BufferOrError = FM.getBufferForFile(*ContentsEntry, IsFileVolatile);
          
          // If we were unable to open the file, then we are in an inconsistent
          // situation where the content cache referenced a file which no longer
          // exists. Most likely, we were using a stat cache with an invalid entry but
          // the file could also have been removed during processing. Since we can't
          // really deal with this situation, just create an empty buffer.
          if (!BufferOrError) {
            if (Diag.isDiagnosticInFlight())
              Diag.SetDelayedDiagnostic(diag::err_cannot_open_file,
                                        ContentsEntry->getName(),

I don’t want to touch source manager unless we have to:

This seems actively harmful FWIW. “requiring” the source files to be present(either by flag or existence) without a good mental model /explanation for devs seems problematic.

It is not the case in my experience. All the users I know (including either build system authors and end users), they either accept that all the input files need to be present or require ways to workaround it (the topic we’re discussing.)

CC some build system authors: @boris-kolpackov @ben.boeckel @Arthapz @aaronmondal

zygoloid · September 5, 2024, 4:55am

Yes, we use this option heavily in our distributed builds. I added the option to fix scalabilty issues we saw due to staging large numbers of header files to our build machines, and it was an effective fix for that issue. (Unlike in non-modules builds, we can’t even prune the set of staged headers to just the ones transitively reached by #includes without this option, and with it we can prune to a small subset of that set. It makes a very large difference to the build action setup cost.)

ben.boeckel · September 5, 2024, 3:52pm

Note that another concern (that, AFAIK, doesn’t affect monorepo-like builds of Google and such) is that there may be many BMIs for a single module due to different flags of importers requiring separate BMIs from other importers. Any overhead of embedding the source is not “just” a one-time thing: it is exacerbated by the number of BMI instances that happen to be necessary.

erichkeane · September 5, 2024, 3:58pm

Thats not really what I was saying. I was saying that if there isn’t feature parity between the compilers, than a build system (basically any mainstream one) that needs to support both, would have to do the file copying anyway.

That said, I hear from you and @zygoloid that this is independently useful ‘enough’.

“No one” meant build systems, not just individuals here.

From what you’re saying, we probably “have to” unfortunately.

I think you’re missing the point. The problem is if it isn’t ALWAYS required, or NEVER required without some good mental rules as to when it would and when it wouldn’t (like say, I used some feature of the language that causes the source-manager issues above, do I now just have to ‘know’ to stay away from that feature?) is going to be actively harmful.

ChuanqiXu · September 6, 2024, 1:48am

But the ratio may not change IIIUC.

ChuanqiXu · September 6, 2024, 1:53am

Maybe the term “required” is misleading. My mental model is, (at least before we fixed the fundamental issue), we need to “assume” all the input files are present when consuming the BMI. (from users perspective)

Or are we you saying we need explicit rules to reproduce it, so that we can know if we fixed issue? (the developers perspective.) If this is the point, then it is true that this is not good. But I don’t know how to improve this. Adding checks during reading the BMI won’t help with the fundamental problem.

erichkeane · September 6, 2024, 1:32pm

What I’m saying is if SOMETIMES I can compile without sources being available, but sometimes not, and for no really good reason on something I (the programmer) did, then that is bad.

SO what would make me feel fine about this is a diagnostic that says “source file X.cpp not available, copy it to <where I’m looking> or use this magic flag” to cover the rest of the cases.

ChuanqiXu · September 9, 2024, 1:54am

While I think the mind model to assume all files need to be present is fine, I think the suggestion of the diagnostic looks good. Will do after I land that patch to offer that option.

mizvekov · September 9, 2024, 10:54pm

Given that clang has a much richer AST than GCC does, with ast-print support in particular, I wonder how much degradation in diagnostics we would have in practice, and how does that compare to GCC.

In any case, I think improving the AST and ast-print is preferable, rather than embedding the source code.

ChuanqiXu · September 10, 2024, 2:09am

I didn’t get your point. How this is related to degradation in diagnostics and ast-print?

mizvekov · September 10, 2024, 2:28am

In clang we are able to go back from AST to source code-like text representation, and from that process we can recover most of the source code as-written from the AST.

That’s supposedly one big advantage clang has over GCC.

Why are we not leveraging that advantage in order to not depend on the source code after we have produced the PCH? Where do we need to improve in that area?

If we can do this source code recovery better than GCC, then we should see less diagnostics degradation than GCC when the source code is not available, at least if we are using this ast-print properly where it needs to be used.

ChuanqiXu · September 10, 2024, 3:49am

Are you saying: since we can print source code from the AST, so we can print the AST directly when we failed to get the contents of some file?

Interesting idea. But there are some problems I can see:
(1) As far as I know, the ability to print source code from AST can’t remain all the information like comments and indentation and spaces, then it might be confusing.
(2) In my reply, I said the place where error message may be emitted is: llvm-project/clang/lib/Basic/SourceManager.cpp at c28b1a19aadff97b369889aee084073a181cfda8 · llvm/llvm-project · GitHub

where we can see in the interface, there is no AST, while it is possible that we can refactor in the upper layer, it may be hard. And even if we were able to do that, as we said in (1), it is not precisely the same thing.

And BTW, in clang today, this is not only about diagnostics, since we may fetch the file contents without the need to emit diagnostics. This is a defect. And I think embedding the source files into the BMI is most simple and not harmful solution in a foreseeable future. As we can see the text size is not large really (one reason maybe that we have rich AST representation)

AaronBallman · September 18, 2024, 1:10pm

Thank you to everyone who weighed in! Conversation seems to be dying down on this RFC, and I think there’s a consensus formed to:

Have the -fmodules-embed-all-files option
Not enable that option by default
Follow-up work will address usability concerns regarding diagnostic behavior when module source is not available

If anyone has strong opinions to the contrary, please speak up ASAP, otherwise I think @ChuanqiXu can move forward.

Topic		Replies	Views
[RFC] About the default location for std modules Runtimes clang , libcxx	53	3390	April 18, 2024
Generating PCM (module interfaces) and regular object files from the same compiler invocation Clang Frontend	8	871	October 13, 2022
Relocatable modules? Clang Frontend	0	137	August 9, 2018
A question about modules implementation Clang Frontend	1	59	December 12, 2015
Unable to access source_location (by std or experimental) Using Clang llvm	8	1749	January 11, 2023

[RFC] [Modules] Should we embed sources to the BMI?

Related topics