Modules TS: binary module interface dependencies

I am trying to understand how Clang's Modules TS will work so that we
have a general-enough model in the build system.

Consider these two modules and a consumer:

// module core
export module core;
export void f (int);

// module extra
export module extra;
import core;
export inline void g (int x) {f (x);}

// consumer
import extra;
int main () {g ();}

Currently, when compiling the consumer (with -fmodules-ts), Clang only
requires the binary module interface (BMI) for extra. In contrast, VC
and GCC require both extra and core (note: even though core is not
re-exported from extra).

The Clang's model is definitely more desirable from the build system's
perspective, especially for distributed compilation. So I wonder if this
is accidental and will change in the future or if this is something that
Clang is committed to, so to speak?

Here is a more interesting variant of the extra module that highlights
some of the issues to consider:

// module extra
export module extra;
import core;
export template <typename T> void g (T x) {f (x);}

Now f() can only be resolved (via ADL) when g() is instantiated.

Thanks,
Boris

I am trying to understand how Clang’s Modules TS will work so that we
have a general-enough model in the build system.

Consider these two modules and a consumer:

// module core
export module core;
export void f (int);

// module extra
export module extra;
import core;
export inline void g (int x) {f (x);}

// consumer
import extra;
int main () {g ();}

Currently, when compiling the consumer (with -fmodules-ts), Clang only
requires the binary module interface (BMI) for extra. In contrast, VC
and GCC require both extra and core (note: even though core is not
re-exported from extra).

The Clang’s model is definitely more desirable from the build system’s
perspective, especially for distributed compilation. So I wonder if this
is accidental and will change in the future or if this is something that
Clang is committed to, so to speak?

I believe this functionality is intentional (after deploying pre-TS modules at Google & finding the number of files necessary without this feature to be problematic (maybe hitting issues with command line length? Not sure what the particular constraint was that it was running up against))

CC’d Richard to correct/clarify/etc

I am trying to understand how Clang's Modules TS will work so that we
have a general-enough model in the build system.

Consider these two modules and a consumer:

// module core
export module core;
export void f (int);

// module extra
export module extra;
import core;
export inline void g (int x) {f (x);}

// consumer
import extra;
int main () {g ();}

Currently, when compiling the consumer (with -fmodules-ts), Clang only
requires the binary module interface (BMI) for extra. In contrast, VC
and GCC require both extra and core (note: even though core is not
re-exported from extra).

Assuming by BMI you mean the .pcm file, this is not true. Clang requires
both the core and extra .pcm files in this case. However, we found it
extremely impractical to explicitly pass all such .pcm files to a
compilation (and indeed on large projects doing so caused us to hit command
line length limits and generally produce *highly* unwieldy command lines),
so we do not require .pcm's that are reachable through the dependencies of
another .pcm file to be explicitly passed to the compiler -- each .pcm
stores names and relative paths to its dependencies, and we load those
dependencies as part of loading the .pcm itself.

You can find some more attempt to understand the impact of modules on
buildsystems here:

Redirecting to Google Groups

Maybe it's useful to you.

Thanks,

Steve.

Richard Smith <richard@metafoo.co.uk> writes:

Assuming by BMI you mean the .pcm file, this is not true. Clang requires
both the core and extra .pcm files in this case. However, we found it
extremely impractical to explicitly pass all such .pcm files to a
compilation (and indeed on large projects doing so caused us to hit command
line length limits and generally produce *highly* unwieldy command lines),
so we do not require .pcm's that are reachable through the dependencies of
another .pcm file to be explicitly passed to the compiler -- each .pcm
stores names and relative paths to its dependencies, and we load those
dependencies as part of loading the .pcm itself.

Got (and tested) it, thanks. I suppose there is no reason for you to
deviate from this once you support module re-export (export import M;)
even though, in a sense, re-export is as-if injecting an implicit import
into the consumer's translation unit?

One thing I noticed is that there is no way to override this embedded
path, at least not with -fmodule-file. This could be useful for
distributed compilation since otherwise the build system will have
to recreate the directory structure on the remote host.

Would there be interest in having a low-level option that specifies
the exact module name to module .pcm mapping and, perhaps, a second
one that can read such mappings from a file? They will then override
module file references in .pcm's.

Thanks,
Boris

Richard Smith <richard@metafoo.co.uk> writes:

> Assuming by BMI you mean the .pcm file, this is not true. Clang requires
> both the core and extra .pcm files in this case. However, we found it
> extremely impractical to explicitly pass all such .pcm files to a
> compilation (and indeed on large projects doing so caused us to hit
command
> line length limits and generally produce *highly* unwieldy command
lines),
> so we do not require .pcm's that are reachable through the dependencies
of
> another .pcm file to be explicitly passed to the compiler -- each .pcm
> stores names and relative paths to its dependencies, and we load those
> dependencies as part of loading the .pcm itself.

Got (and tested) it, thanks. I suppose there is no reason for you to
deviate from this once you support module re-export (export import M;)
even though, in a sense, re-export is as-if injecting an implicit import
into the consumer's translation unit?

Right. From the point of view of a user of the re-exporting module, they
don't depend on M, so they should not need to specify a .pcm file for M.

One thing I noticed is that there is no way to override this embedded

path, at least not with -fmodule-file. This could be useful for
distributed compilation since otherwise the build system will have
to recreate the directory structure on the remote host.

So far, we've not seen this be a problem in practice across the (small)
number of build systems where we've implemented support for explicit module
builds. If this is a problem for your build system, we can certainly look
at adding support for overriding this.

Would there be interest in having a low-level option that specifies

the exact module name to module .pcm mapping and, perhaps, a second
one that can read such mappings from a file? They will then override
module file references in .pcm's.

I don't think we need a mapping mechanism; giving us the module files on
the command line in topological order should suffice. If we've already been
handed a module file for module X, and then we load a module file for
module Y that depends on X, we can simply ignore the path specified in Y's
.pcm and just use the existing X .pcm. (We'd still perform the check that
the X .pcm is the same as the one that Y was built against in this case.)

We could actually build the topological ordering ourselves, but that would
require a two-pass approach for loading .pcm files; passing this burden on
to the build system seems like the better tradeoff.

Richard Smith <richard@metafoo.co.uk> writes:

I don't think we need a mapping mechanism; giving us the module files on
the command line in topological order should suffice. If we've already been
handed a module file for module X, and then we load a module file for
module Y that depends on X, we can simply ignore the path specified in Y's
.pcm and just use the existing X .pcm. (We'd still perform the check that
the X .pcm is the same as the one that Y was built against in this case.)

I've done some testing and this is not how it works today. Perhaps you
meant it in the "could be done this way" sense.

BTW, I've also tested moving the entire build directory somewhere else
to check if .pcm's store relative paths to each other. This does not
appear to work either:

fatal error: malformed or corrupted AST file: 'SourceLocation remap refers to unknown module, cannot find core.pcm'

On the more fundamental level, this still poses a problem if the build
system needs to re-map all the .pcm files (e.g., for a distributed build):
we will still have crazy-long command lines and may hit the command line
limits. So, at a minimum, we seem to need a way to load the list of modules
from a file.

Now, for why we may want a mapping, not just a list of .pcm's: if the list
of .pcm's is stored in a file, then chances are some build systems will
opt to have one file per project (or some similar granularity) rather
than per translation unit. Which means not all listed .pcm's will be
needed during every compilation. If it's only a list of .pcm's, then
Clang will have to at least read each file, which seems like a waste.

We could actually build the topological ordering ourselves, but that would
require a two-pass approach for loading .pcm files; passing this burden on
to the build system seems like the better tradeoff.

I agree. Though requiring a sorted list of modules doesn't make build
system's life any easier, especially if it wants to weed out duplicates
(to keep the command line as tidy as possible) and not allocate any
extra memory while doing it.

Boris

Yes, this is a “we can” rather than a “we already do”. Sorry that wasn’t clear.

We’ve talked about making this kind of relocation easier by allowing the module source directory and the build directory to be relocated independently (right now you need to relocate everything together – sources, .pcm’s, working directory).

Clang does support specifying @file on the command line to take arguments from a file, which should at least evade the command line length limit.

True. At that point I think you’d be better off with a directory of .pcm files following a naming convention rather than providing the compiler with a (potentially very large) set of mappings (and we already support something like that). But allowing an explicit mapping to be specified would also be fine if people would actually use that facility.

Granted. Our design right now is pretty strongly tied to having loaded all dependency modules before loading a dependent module, though, so we need that complexity somewhere.

Richard Smith <richard@metafoo.co.uk> writes:

We've talked about making this kind of relocation easier by allowing the
module source directory and the build directory to be relocated
independently (right now you need to relocate everything together --
sources, .pcm's, working directory).

That's what I did and got the above-mentioned error.

At that point I think you'd be better off with a directory of .pcm files
following a naming convention rather than providing the compiler with
a (potentially very large) set of mappings (and we already support
something like that).

Here is a concrete scenarios I am thinking about: I want to implement
distributed compilation that supports modules. Which means that
besides the translation unit itself, the build system also needs
to ship .pcm's of all the modules that this TU imports (transitively).

In itself, this is not a problem: the build system needs to make sure
that these .pcm's are all up-to-date before it can invoke the compiler.
So it got to know the paths to all the .pcm's which, in case of build2,
are spread out across various project directories (since we try to re-
use already compiled .pcm from projects that we import).

For distributed compilation we want to minimize the amount of stuff we
copy back and forth so it makes sense to cache .pcm's on the build
slaves (the same .pcm is likely to be used by multiple TUs). So on
the build slave I would store a list of .pcm files, their hashes,
and their module names. Since the same module can be compiled with
different options and result in a different .pcm/hash, I would use
the hash as the file name to store .pcm's on the slave (i.e., content-
addressable storage).

With this pretty straightforward setup, when time come to compile
a TU, all I need is to somehow communicate to the compiler the
mapping of module names to these hash-named .pcm's. If there were
a way to provide this mapping in a file, I would be all set.

With the directory approach, I would need to create a temporary
directory and populate it with appropriately-named symlinks (or
copies in case of Windows) of .pcm files. While not particularly
hard, it sure feels unnecessary. I would definitely try to avoid
doing this for local compilations which means I will have two
different ways of invoking the compiler depending on whether it
is remote or local. And it is still not clear to me how this will
override embedded .pcm references.

But allowing an explicit mapping to be specified would also be fine
if people would actually use that facility.

I will use it in build2. And I am willing to try to implement it.

Our design right now is pretty strongly tied to having loaded all
dependency modules before loading a dependent module, though, so
we need that complexity somewhere.

I don't think we will need it with the mapping approach: we will have
a map of module names to file names, probably in HeaderSearchOptions
next to PrebuiltModulePaths -- in a sense it will be another module
search mechanism that will be tried before prebuilt paths (in
HeaderSearch::getModuleFileName()).

This map will be populated before we actually load any modules so
the order in which one specifies the mapping is not important
(except for overriding). I will probably need to add some extra
code to consult this map when resolving embedded .pcm references,
though.

And we could also keep updating this map when loading modules via
other means (e.g., with -fmodule-file) which will give us the
override behavior we discussed earlier (I won't need this
functionality in build2 but could implement it if others think
it would useful).

If this sounds reasonable, I can give it a go.

Thanks,
Boris

Richard Smith <richard@metafoo.co.uk> writes:

> We've talked about making this kind of relocation easier by allowing the
> module source directory and the build directory to be relocated
> independently (right now you need to relocate everything together --
> sources, .pcm's, working directory).

That's what I did and got the above-mentioned error.

Hmm, that could well be a bug, then. Do you by any chance have steps to
reproduce this?

At that point I think you'd be better off with a directory of .pcm files
> following a naming convention rather than providing the compiler with
> a (potentially very large) set of mappings (and we already support
> something like that).

Here is a concrete scenarios I am thinking about: I want to implement
distributed compilation that supports modules. Which means that
besides the translation unit itself, the build system also needs
to ship .pcm's of all the modules that this TU imports (transitively).

In itself, this is not a problem: the build system needs to make sure
that these .pcm's are all up-to-date before it can invoke the compiler.
So it got to know the paths to all the .pcm's which, in case of build2,
are spread out across various project directories (since we try to re-
use already compiled .pcm from projects that we import).

For distributed compilation we want to minimize the amount of stuff we
copy back and forth so it makes sense to cache .pcm's on the build
slaves (the same .pcm is likely to be used by multiple TUs). So on
the build slave I would store a list of .pcm files, their hashes,
and their module names. Since the same module can be compiled with
different options and result in a different .pcm/hash, I would use
the hash as the file name to store .pcm's on the slave (i.e., content-
addressable storage).

With this pretty straightforward setup, when time come to compile
a TU, all I need is to somehow communicate to the compiler the
mapping of module names to these hash-named .pcm's. If there were
a way to provide this mapping in a file, I would be all set.

For what it's worth, this setup with named symlinks (whose names are stable
across all builds) is how our (Google's) internal build system handles this.

With the directory approach, I would need to create a temporary

directory and populate it with appropriately-named symlinks (or
copies in case of Windows) of .pcm files. While not particularly
hard, it sure feels unnecessary. I would definitely try to avoid
doing this for local compilations which means I will have two
different ways of invoking the compiler depending on whether it
is remote or local.

Because you don't use the content-addressed system locally? What we do is
to use symlinks for remote compilations and just put the files in the
"right" places locally, so the file system looks the same either way.

And it is still not clear to me how this will
override embedded .pcm references.

I don't think it would, but if the paths to dependencies are always the
same, you shouldn't need to override any of those references.

But allowing an explicit mapping to be specified would also be fine
> if people would actually use that facility.

I will use it in build2. And I am willing to try to implement it.

OK :slight_smile:

Our design right now is pretty strongly tied to having loaded all
> dependency modules before loading a dependent module, though, so
> we need that complexity somewhere.

I don't think we will need it with the mapping approach: we will have
a map of module names to file names, probably in HeaderSearchOptions
next to PrebuiltModulePaths -- in a sense it will be another module
search mechanism that will be tried before prebuilt paths (in
HeaderSearch::getModuleFileName()).

This map will be populated before we actually load any modules so
the order in which one specifies the mapping is not important
(except for overriding). I will probably need to add some extra
code to consult this map when resolving embedded .pcm references,
though.

And we could also keep updating this map when loading modules via
other means (e.g., with -fmodule-file) which will give us the
override behavior we discussed earlier (I won't need this
functionality in build2 but could implement it if others think
it would useful).

If this sounds reasonable, I can give it a go.

Sure. I think my only remaining concerns are:

1) this is likely to end up with a set of command line arguments that grows
linearly with the total number of modules in the project, and you're likely
to find the build system needs or wants to prune the list down to just the
dependencies anyway
2) we can't do any validation that the command line arguments are
reasonable if the corresponding module is not used (we don't want to stat a
large number of .pcm files if most of them are not going to be used, and
definitely don't want to read the file header to find if it names the right
module)

I don't think (2) is really a big deal, though, since we'll get at least a
"file not found" error if the module is actually used by the compilation.
And (1) is ultimately your problem as the build system maintainer, not
ours. :wink:

Richard Smith <richard@metafoo.co.uk> writes:

Do you by any chance have steps to reproduce this?

This is with 5.0.0-svn305177-1~exp1 (trunk):

mkdir /tmp/test
cd /tmp/test

cat >core.mxx <<EOF
export module core;
export void f ();
EOF

cat >extra.mxx <<EOF
export module extra;
import core;
EOF

cat >driver.cxx <<EOF
import extra;
int main () {}
EOF

clang++-5.0 -std=c++1z -fmodules-ts -o core.pcm --precompile -Xclang -fmodules-embed-all-files -Xclang -fmodules-codegen -Xclang -fmodules-debuginfo -x c++-module core.mxx
clang++-5.0 -std=c++1z -fmodules-ts -fmodule-file=core.pcm -o extra.pcm --precompile -Xclang -fmodules-embed-all-files -Xclang -fmodules-codegen -Xclang -fmodules-debuginfo -x c++-module extra.mxx
clang++-5.0 -std=c++1z -fmodules-ts -fmodule-file=extra.pcm -o driver.o -c driver.cxx

cd ..
mv test ~/
cd ~/test

clang++-5.0 -std=c++1z -fmodules-ts -fmodule-file=extra.pcm -o driver.o -c driver.cxx
fatal error: module file '/tmp/test/core.pcm' not found: module file not found
note: imported by module 'extra' in 'extra.pcm'

Richard Smith <richard@metafoo.co.uk> writes:

Because you don't use the content-addressed system locally? What we do is
to use symlinks for remote compilations and just put the files in the
"right" places locally, so the file system looks the same either way.

Locally we use things as arranged by the user. For example, if project
bar uses libfoo that contains libfoo/foo.pcm, then we will (try to)
use this libfoo/foo.pcm where the user built it.

On the build slave, however, we may be building things for multiple
projects simultaneously and each may have its own foo.pcm. So here
we will call it something like modules/93255[...]5f6db7.pcm.

If I am able to specify the module-name to module-file mapping, I
will be doing essentially the same thing locally and remotely:

-fmodule-blah=foo=libfoo/foo.pcm

-fmodule-blah=foo=modules/93255[...]5f6db7.pcm

1) this is likely to end up with a set of command line arguments that grows
linearly with the total number of modules in the project, and you're likely
to find the build system needs or wants to prune the list down to just the
dependencies anyway
2) we can't do any validation that the command line arguments are
reasonable if the corresponding module is not used (we don't want to stat a
large number of .pcm files if most of them are not going to be used, and
definitely don't want to read the file header to find if it names the right
module)

I don't think (2) is really a big deal, though, since we'll get at least a
"file not found" error if the module is actually used by the compilation.

Agree. It will either be detected at some point or it will be harmless.

And (1) is ultimately your problem as the build system maintainer, not
ours. :wink:

Agree. Also, my plan is to have two options: one to specify the mapping
on the command line (one entry at a time) and the other to read it from
a file. So the file option will help build systems that, for example,
want to specify a single (and potentially large) mapping file per project
or some such.

Which brings me to the most difficult part: choosing option names that
everyone likes ;-). And, BTW, I am hoping to implement the same in GCC
with the same names.

So we are looking for two options, one to specify a mapping entry and
the other to specify a mapping file with multiple entries:

-fmodule-blah=<name>=<file> | -fmodule-blah-blah=<file>

Here is what I came up with:

(1) -fmodule= | -fmodule-map=
(2) -fmodule-map= | -fmodule-map-file=
(3) -fmodule-loc= | -fmodule-loc-file=
(4) -fmodmap= | -fmodmap-file=

1. While nice and short, the use of -fmodule might be too close to
   -fmodules. On the other hand, these options will normally be used
   by build systems (the user will just use -fmodule-file) so probably
   not a major issue.

2. These are nice except -fmodule-map-file is already used. One way
   to resolve this would be to "overload" -fmodule-map-file to mean
   something different in the -fmodules-ts mode. Though I suspect its
   current meaning could be useful even in -fmodules-ts.

3. This is an attempt at using something other than 'map'. It has a
   nice property of suggesting that specifying these options doesn't
   actually cause the modules to be loaded.

4. Another play on the 'map' theme. I think it will be hard to sell
   to the GCC folks since they don't have the -fmodule-map-file issue.

Any preferences/suggestions? My favorite is (1).

Thanks,
Boris

-fmodule= is a little too nonspecific for my tastes; I'd expect this to do
what clang's -fmodule-name= does (that is, specify the name of the current
module) before I'd expect it to specify an external module file's path.

How about something like -fmodule-file-<name>=path?

Richard Smith <richard@metafoo.co.uk> writes:

> -fmodule-blah=<name>=<file> | -fmodule-blah-blah=<file>
>
> Here is what I came up with:
>
> (1) -fmodule= | -fmodule-map=
> (2) -fmodule-map= | -fmodule-map-file=
> (3) -fmodule-loc= | -fmodule-loc-file=
> (4) -fmodmap= | -fmodmap-file=
>
> 1. While nice and short, the use of -fmodule might be too close to
> -fmodules. On the other hand, these options will normally be used
> by build systems (the user will just use -fmodule-file) so probably
> not a major issue.
>
> 2. These are nice except -fmodule-map-file is already used. One way
> to resolve this would be to "overload" -fmodule-map-file to mean
> something different in the -fmodules-ts mode. Though I suspect its
> current meaning could be useful even in -fmodules-ts.
>
> 3. This is an attempt at using something other than 'map'. It has a
> nice property of suggesting that specifying these options doesn't
> actually cause the modules to be loaded.
>
> 4. Another play on the 'map' theme. I think it will be hard to sell
> to the GCC folks since they don't have the -fmodule-map-file issue.
>

-fmodule= is a little too nonspecific for my tastes; I'd expect this to do
what clang's -fmodule-name= does (that is, specify the name of the current
module) before I'd expect it to specify an external module file's path.

On the other hand, -fmodule=<name>=<file> describes the module completely
(name and .pcm) while -fmodule-name and -fmodule-file are sub-components
(thought in slightly different contexts). But I agree, -fmodule is probably
too terse.

How about something like -fmodule-file-<name>=path?

Is this really -fmodule-file-<name> (as in -fmodule-file-foo.core=core.pcm)
or was it supposed to be '=' (as in -fmodule-file=[<name>=]<file>)?

I think the former is too unconventional and will be hard to support
in most option parsers (I know for sure GCC will be a pain).

I like the latter, that is, "extend" -fmodule-file with optional module
name. The semantics, as I understand it, will be a bit different though:
-fmodule-file=<file> will cause the module to be loaded while
-fmodule-file=<name>=<file> only makes the location of the module known.
But I don't think the difference will be observable by the end user (i.e.,
loading a module that is not imported does not change anything)?

If we go with -fmodule-file=[<name>=]<file> then the second options will
naturally be -fmodule-file-map=<file>. I like it.

Boris