[RFC] [Modules] Should we embed sources to the BMI?

This comes from [C++20] [Modules] Embed all source files for C++20 Modules by ChuanqiXu9 · Pull Request #102444 · llvm/llvm-project · GitHub and Shouldn't `-fmodules-embed-all-files` be the default? · Issue #72383 · llvm/llvm-project · GitHub. Given the history are pretty long, and there are a lot of discussed topics, I’ll try to describe the background here.

Background

Source location is a pretty fundamental concept to clang. During the compilation of the frontend, the compiler needs to have a reference to the contents of the source file some times.

For modules, currently we will embed the source location as offsets instead of embedding the file contents. So when we compile with modules and we need to get the contents of a file, we will try to open the file on disk and get the contents according to the recorded source location.

The behavior has two results:
(1) The BMI became invalid after we change the source file.
(2) When we compile with modules, we always need the dependent source files to be present in the compiling environments.

The first point is minor and it may only affect some debugging experience. But the second point more or less affects the experience of distributed builds and sandboxed builds. If we can get rid of it, we can avoid downloading the source files or putting the files into the sandbox when compiling with modules.

And according to the discussion in the issue’s thread, both GCC and MSVC don’t have such constraints.

To achieve this, there is an issue to require to make -fmodules-embed-all-files to be the default option. Then the source contents will be embedded to the BMI, then we won’t be required to open the referenced file actually during the compilation of modules.

Security concerns on IP

@AaronBallman points out such strategy has security concerns. Since after this, the source contents in the BMI are technically public to the consumers.

My thought to the security question

  1. According to the consensus of SG15/WG21, it is not expected to distribute the BMIs. Since BMI is pretty sensitive to the compiler (with different versions) and configurations. But as pointed by @iains, there are end users who consistently require distributable BMI. (*)
  2. The BMI is built module interface. And the module interfaces may only take the position of headers. Then headers are never private. So I think the option may never make things worse than the era of headers.
  3. Today when we use BMIs, we would always require the source files to be available in the environments. So I think the option won’t make existing usage worse.

(*): It is true that distributable BMI is technically possible. The primary blocking issue is that it is too hard to (design and) implement.

Summary and options forward

(1) Fix the underlying issue. Readers may already recognize that the two topics (whether or not embedding source files) (security concerns) are not technically mutually exclusive. The fundamental technical problem may be that clang require to open the actual file during the compilation. It looks like both GCC and MSVC doesn’t have the problem. But the problem is, it is too fundamental and I actually, don’t have an idea about how to fix it. If we choose this, from the perspective of the actual process, it may be the same thing as not doing any thing in a relative long period.

(2) Make -fmodules-embed-all-files a user option. Currently -fmodules-embed-all-files is a Xclang option. And after we make it a user option. The users who want it can enable it themselves. Then we don’t need to care about the security problems. While this may be the most people prefer, a problem in detail is, should we make it enabled by default? If we don’t make it enabled by default, it is hard for the test to cover. And I think the most people may prefer to embed the source files.

(3) Just embeding the source files. As I mentioned above, I don’t think it will be security a problem since it won’t make any thing worse. And the reason why to not add a new option is, there are already too many options, especially for modules. We talked this many times in modules developers meeting that we don’t like to many options. (The new added -fexperimental-modules-reduced-bmi option is planned to be enabled by default and be removed).

:white_check_mark: Clang: consensus called in this message

3 Likes

@ChuanqiXu thanks for starting this discussion (and apologies in advance - I am likely going to repeat some of what you already said).

To expand a little on the underlying points:

As currently designed, binary C++ modules (BMIs) are transient artefacts of a build (for some definition of ‘build’). A BMI is a (potentially) more efficient encapsulation of a series of sources that define an interface (since significant parts of the compilation have taken place, sharing that between consumers saves resources).

  • It must be assumed that a consumer of a BMI has access to the sources needed to build it; since it is transient, it does not exist at the start of the build (and might not even be built unless/until it it required).

  • There are implementation-related constraints that reinforce this (since for the current implementations on clang and GCC, at least, the content of the BMI is tightly coupled to the compiler’s AST). This means that the build conditions for the consumer need to (closely) match those of the producer.

  • Assuming that these build conditions are met, when consuming a BMI, it is still expected that access to the source would be required to avoid degradation of diagnostics (since those can refer to content used to build said BMI).

So, we expect that toolchains will require access to the source of the BMIs at the point that they are consumed (at least for clang and GCC in my experience).

One mechanism to ensure that the source is available to the consumer is to embed it into the BMI - and making that a default has triggered this discussion.

(For avoidance of doubt - it is not that GCC does not require the sources for good diagnostics - but that currently there has been no decision to embed those sources into the BMI (and there is an available fallback that would produce reduced diagnostic quality, rather than a failure to compile)).

Security Implications

There has been some comment that embedding sources like this might in some way leak intellectual property making it unusable in non-open-source scenarios.

I do not think that this should be the case.

Since the BMIs are ephemeral, and do not exist at the start of a build, the build system / consumers must have access to the sources used to construct them.

Thus (for some distributed library, for example) those sources are analogous to public headers of the library; it is up to library vendors to partition their sources such that private information is not required in the public BMI.

Tooling Implications

I guess the question (at least from my perspective) is more whether the compiler should be responsible for the action of ensuring that source is available to the consumer of a BMI. Arguably, this is the build system’s responsibility - and in a pre-modules world there must already be mechanisms for making headers etc. available.

One use-case cited is large-scale distributed builds, where the flags are consistent such that BMIs can be consumed on remote nodes without recompilation - but still the source is required for diagnostic quality; embedding the source means that only one artefact needs to be copied to the remote node in this case.

  • The general principle of C++ is that one should not pay for something that one does not use. So it can be argued that all the non-distributed builds should not be paying a size penalty (albeit not a large one) for something they do not require.

  • In any event (IMO) it would be a good idea to agree on the mechanism if it is to be used [between as many toolchains as feasible], and on the compiler flags etc. so that tooling does not have to change actions significantly between toolchains.

2 Likes

I summarize your reply as, you don’t think it is a security issue, but also not support to embed the sources to the BMI due to:
(1) It may enlarge the size of the BMI.
(2) It may make more divergence in build systems.

For the first, the data provided by @boris-kolpackov in Shouldn't `-fmodules-embed-all-files` be the default? · Issue #72383 · llvm/llvm-project · GitHub is, the size of the BMI with embedded source is 31662KB with and 30574KB without the embedded source.

For the second point, I don’t feel it is strong since clang won’t ask the build systems to change the behavior of this change. The build systems can still do what they do now. They will only have a chance to optimize the build process for clang. The optimization can be done for other compilers too. And it should be pretty faster to not move the source files between sandboxes and different machines.

Does “require” mean “the compiler rejects the code trying to import a BMI when the source is not available?”

If the answer is “no”, then my concerns about distributing the source remain. We have plenty of historical experience with people shipping PCHs (also not considered a best practice); not recommended is not the same as not possible. I think we should be secure by default, and the more secure default is to not ship the original source code as part of the BMI. Because this is a distribution model we explicitly recommend against, getting a degraded user experience with poor diagnostic output (etc) is reasonable IMO. (Also, we could issue a diagnostic telling the user the reason for the degraded experience, which helps educate users on the better distribution model.)

However, if the compiler does reject BMIs without access to the original source, then I no longer have concerns about shipping the source within the BMI.

I wish C++ source would not be considered an IP that needs to be protected behind a binary wall of sorts. Like, you know, many other languages do.

FWIW, I’m not opposed to the option existing, only speaking about the defaults. There’s historical inertia that’s hard to overcome – I don’t think anyone expects to be able to get back to full, original source from a binary artifact in either C or C++ (decompilers can get you plenty close, but this is original source including comments), so I think it’s an unsafe default because 1) it exposes private, sometimes sensitive information, 2) it’s the opposite of what entrenched users will expect which makes it a really easy thing to overlook.

I am not confident that it is the correct default.

Also I do not recall any design discussion of alternate strategies - perhaps there are none (but we should confirm that).

I started a thread on GCC’s development mailing list (which also points to this discussion).

https://inbox.sourceware.org/gcc/66d70824.630a0220.26e6d4.e74d@mx.google.com/T/#t

It is more that, in any realistic build strategy, with current C++ modules, the sources are required to build the BMI once - even if it is consumed many times. This is because it is currently not possible (or expected) to distribute a BMI along with some library [for all the same reasons as PCH at minimum].

That is the origin of the ‘requirement’ - and a reason to be state that all we are talking about here is whether the compiler should be responsible for distributing the sources with the BMI, or whether it should be build-system territory.

My point is that it may not be expected to do so, but it certainly is possible and users will do it in practice because it solves a problem for them. The specific scenario I’m aware of with PCH (and I believe the same will be true for modules) is when a company produces a library to be linked statically by an application produced by a subcontractor. In this scenario, the company is dictating the development environment used by the subcontractor, which is the only reason this “works”. FWIW, I don’t think we should encourage this sort of deployment practice.

I think it should be the build system’s responsibility. For one, that solves my concerns about accidentally exposing sensitive IP. For two, that makes the BMIs (ever-so-slightly) smaller. For three, it makes the dependency between the BMI and the original source more explicit – this is beneficial for third-party tooling such as static analyzers (they need to get back to the original source code and often don’t want to try to recognize every compiler’s BMI format).

My answer is: No.

There are a handful of negative problems with including the source, the most important of which to me, is the cost of opening/analyzing/using/etc the BMI file. The larger it is, the more problematic that is going to be on filesystems, load times, etc.

So that leaves me with two variants of the proposal:

1- Require the build system/whatever ensure that the source is available in the same version as the BMI file, and error-ing if not. IMO, this permits us to delay the parsing of the source file until absolutely necessary (that is, not touch the source file until we need it to emit an error, though I know we don’t do that today).

2- Same as above, but come up with a good way of recovering. Since the purpose of needing the source is diagnostics, a good ‘fallback’ mechanism of “we don’t have the source file, so we can’t point at anything here” is a good way forward. It would be GREAT if we stored Line/Column info as well as offset, so even if we’re not pointing at the line, we can refer to it.

IMO, Option #2 is the best. If the person wants good diagnostics, they have to provide the file, else they get less good diagnostics and life goes on. This supports the distribution concerns Aaron has (he’s right, folks are going to ship BMI whether we like it or not, unless we actively prevent it) while not introducing any IP issues (NOTE: Distribution of source isn’t really a SECURITY concern as stated by Aaron, its an IP concern).

1 Like

It makes sense to me that embedding the source file exist as an option. Distributed builds are indeed a motivational use case for that.
However, i think it’s not a very good default.

As @erichkeane’s it is wasteful, and ever if I wholeheartedly agree that that distributing BMI is a not a great idea and should not be encouraged… people are still going to do it. And as other said there are security concerns there.
From there, it is reasonable to assume some people will need a way to make sure the source is not embedded.
Which means that whether the source is available or not the quality of diagnostics, tooling etc should be the same. or at least serviceable enough. Which means we can’t do anything useful with the source.

So at the end of the day, I think the following observations

  • In most cases the source will be available on disk so we don’t need to embed it
  • Clang needs to work well enough regardless of whether the source is present or not
  • Having a BMI without source is only viable in a very narrow set of circumstances (distributed build)

Point to needing an option to embed sources files that is not enabled by default.

And when the user decide to embed the source, should they have the possibility to strip comments? (I am not sure about that at all, but I think it’s worth asking ourselves the question)

I think we’re conflating a few different things here. I’d like to talk about just one of them: the security perspective.

I think the most important thing to be clear on is: do we guarantee to remove any information when producing a PCM file? If not, that is, if the full, original source might be recoverable, and we would not consider it to be a bug if we made more information recoverable, then we should simply document that the PCM is potentially equivalent to the original source and should never be relied upon to remove information. Pretending that we hide information when we don’t would not do anyone any favors.

Given the nature of the information in the PCM file, I think the only information we could even conceivably guarantee that we remove is code comments. So, do we want to provide a new guarantee, that a PCM file will never contain comments (unless you turn on an option like -fmodules-embed-all-files)? I think that the answer is no: we already preserve documentation comments in at least some modes, and may want to do more of that in future. (For example, in an IDE it can be useful to show the comments associated with an entity, and so it would be reasonable to track those comments in PCM files.) As it happens, we also do embed source files in some cases even with -fmodules-embed-all-files disabled – IIRC files that are not regular files (eg #embed "/proc/self/cmdline") or that have VFS content overrides (generally, files we can’t reasonably expect to be available with the same contents later) always get embedded.

Because of the above, I think the only responsible and forward-looking thing we can do from a security perspective is to clearly and publicly state that PCM files are not and should not be treated as an information hiding mechanism – they should always be assumed to contain all the information that was used to create them, in a recoverable form.

So let’s document that. And then let’s separately decide whether we want -fmodules-embed-all-files on by default, but without any security / IP concerns.

3 Likes

That (@zygoloid 's post) seems really reasonable. I think there are plenty of reasons to NOT include source, but I think I’m convinced now that the security/IP horse is already out of the barn on that one, so I think documenting that the BMI necessarily contains source-like-constructs should be the appropriate way forward.

And frankly, I don’t think ANY of us want the contents of BMIs to not have that restriction messaging in the future: I’d hate for us to have to limit the contents of our compilation of the BMI (for example levels of LTO) based on some misguided guarantee of IP protection.

Thank you for this perspective!

Agreed.

For the record, there’s no appetite for embedding the sources in BMIs coming from the GCC thread.

It seems to me that @erichkeane has put forward the best solution here in #2 of the post earlier - i.e. make it possible for clang to consume a BMI without the source available with the only ‘expense’ being degraded diagnostics. That is, essentially, the status quo for GCC as I understand it.

1 Like

Thanks for the comments. I think the consensus is:

(1) Provide an option for it. But it won’t be the default.
(2) Document that: “PCM files are not and should not be treated as an information hiding mechanism – they should always be assumed to contain all the information that was used to create them, in a recoverable form”.


Some other clarifications:

(1) Currently, in clang, the contents of the file may not only be required when a diagnostic is going to be emitted.

That may be the status of GCC now. Or the ideal status for clang. But now, clang may open the file on disk during the compilation without emitting diagnostics. One example is in the github issue. And as I stated in the post, this is not ideal and can be improved. But I don’t have an insight for it and don’t know how to fix it.

(2) The comments are needed.

They are pretty helpful in LSP tools like clangd.

(3) Does “require” mean “the compiler rejects the code trying to import a BMI when the source is not available?”

Not always. I mean the logically “require”. Now when we load a module, we won’t check if all its input files are present. As we always tried, (to make things lazy in modules), we would only try to open the source files when we want to. So it is completely possible that the compiler accept the code without available source files on disk without embedding the source files.

(4) Embedding the source files are not only beneficial for distributed builds but also sandboxed builds.

2 Likes

Sent [C++20] [Modules] Offer -fmodules-embed-all-files option by ChuanqiXu9 · Pull Request #107194 · llvm/llvm-project · GitHub

So to respond to the consensus call, I don’t think that properly carrys my thoughts on it, but I’m not competely opposed to it.

For #1: I am against actually having an option for it. I think that if GCC isn’t going to do it, build systems are going to have to move the code anyway, so this ends up being a rarely/never used functionality that we have to test and maintain.

But now, clang may open the file on disk during the compilation without emitting diagnostics. One example is in the github issue.

Which github issue do you mean? That said, I would probably wish that we spend time fixing THAT requirement, or finding a valid fallback in those situations for when the file doesn’t exist.

So it is completely possible that the compiler accept the code without available source files on disk without embedding the source files

This seems actively harmful FWIW. “requiring” the source files to be present(either by flag or existence) without a good mental model /explanation for devs seems problematic.

(4) Embedding the source files are not only beneficial for distributed builds but also sandboxed builds.

I guess I only see this being the case for custom build systems, because if GCC or MSVC doesn’t do this too, no one is going ot use this feature.

Maybe we can coordinate with GCC here? Distributed builds, which I consider important enough, would like it, and if we’re willing to have such flag, that might tip the scale for GCC.