[RFC] Storing relative paths in .pcm files

Hey Richard (& cfe-dev),

Currently if one AST file imports another (e.g. module A imports module B), we store the absolute path of module B inside module A’s IMPORTS record. When we know that both files will always be in the same directory, this wastes space and more importantly prevents moving those modules to another directory. The latter is very handy when debugging a module bug for which someone has given you their broken module cache.

When an implicitly built module imports another implicitly built module, we can rely on the modules always being in the same module cache, and I think we should switch to a relative path that is either looked up relative to the current pcm file or the (hash-specific) module cache dir. Do you think we should do this for explicitly built modules that happen to be in the same directory? What about implicitly built modules that are imported by explicitly built modules?

Cheers,

Ben

Hey Richard (& cfe-dev),

Currently if one AST file imports another (e.g. module A imports module
B), we store the absolute path of module B inside module A’s IMPORTS
record. When we know that both files will always be in the same directory,
this wastes space and more importantly prevents moving those modules to
another directory. The latter is very handy when debugging a module bug
for which someone has given you their broken module cache.

When an implicitly built module imports another implicitly built module,
we can rely on the modules always being in the same module cache, and I
think we should switch to a relative path that is either looked up relative
to the current pcm file or the (hash-specific) module cache dir. Do you
think we should do this for explicitly built modules that happen to be in
the same directory?

My initial reaction is that we should preserve the path given in the
-fmodule-file= argument on the command line. If I use
-fmodule-file=x/foo.pcm and explicitly build y/bar.pcm, I think that
y/bar.pcm should say that it finds foo in 'x/foo.pcm'.

If the user then builds with -fmodule-file=z/foo.pcm
-fmodule-file=y/bar.pcm, we should probably ignore the path that was
specified for 'foo' when building 'bar'.

What about implicitly built modules that are imported by explicitly built
modules?

It seems tricky to make that work transparently if the modules have been
relocated. We shouldn't expect that explicitly-built modules are located
anywhere near the module cache, so I guess the best we can do is to look
for such files in the module cache by default (even if the module cache has
moved), and not bother writing out /path/to/module/cache/thing.pcm. If
they've been relocated, then I suppose you could explicitly import them
with -fmodule-file=$foo.

However, we need to be cautious that things can change between explicit
module build and use, so we need to use the parameters from the explicit
module itself when determining the configuration hash of the implicit
module. Maybe the simplest thing to do is to skip this case for now; we'd
only be saving the space cost of writing out the path to the module cache,
and I don't think that's a big deal (at least, not compared to the 100K we
waste on a name lookup table for builtins and keywords in each module).

This makes sense to me. In that case, we’ll probably need to store another bit to distinguish “relative to CWD” from “relative to module cache”, or else -fmodule-file=.pcm might choose an unexpected file. Alternatively, we could store the ModuleKind for the module when it was written (as opposed to when it was loaded), I guess.

I assume you mean ‘loading bar’.

Good point, I hadn’t considered this issue.

Sounds good.

OT, but: Fixing that has been near the bottom of my TODO list for a long time. IIRC it’s not just a waste of space, because if a system module defines one of those builtin names (e.g. ceil in tgmath.h) we might find the wrong one because we take the first one we find that’s up to date.

Hey Richard (& cfe-dev),

Currently if one AST file imports another (e.g. module A imports module
B), we store the absolute path of module B inside module A’s IMPORTS
record. When we know that both files will always be in the same directory,
this wastes space and more importantly prevents moving those modules to
another directory. The latter is very handy when debugging a module bug
for which someone has given you their broken module cache.

When an implicitly built module imports another implicitly built module,
we can rely on the modules always being in the same module cache, and I
think we should switch to a relative path that is either looked up relative
to the current pcm file or the (hash-specific) module cache dir. Do you
think we should do this for explicitly built modules that happen to be in
the same directory?

My initial reaction is that we should preserve the path given in the
-fmodule-file= argument on the command line. If I use
-fmodule-file=x/foo.pcm and explicitly build y/bar.pcm, I think that
y/bar.pcm should say that it finds foo in 'x/foo.pcm’.

This makes sense to me. In that case, we’ll probably need to store
another bit to distinguish “relative to CWD” from “relative to module
cache”, or else -fmodule-file=<some implicitly built module>.pcm might
choose an unexpected file. Alternatively, we could store the ModuleKind
for the module when it was written (as opposed to when it was loaded), I
guess.

If the user then builds with -fmodule-file=z/foo.pcm
-fmodule-file=y/bar.pcm, we should probably ignore the path that was
specified for 'foo' when building 'bar’.

I assume you mean ‘loading bar'.

Err, I mean we should ignore the path for foo that was specified at the
time when bar was built when loading bar.

What about implicitly built modules that are imported by explicitly built

modules?

It seems tricky to make that work transparently if the modules have been
relocated. We shouldn't expect that explicitly-built modules are located
anywhere near the module cache, so I guess the best we can do is to look
for such files in the module cache by default (even if the module cache has
moved), and not bother writing out /path/to/module/cache/thing.pcm. If
they've been relocated, then I suppose you could explicitly import them
with -fmodule-file=$foo.

However, we need to be cautious that things can change between explicit
module build and use, so we need to use the parameters from the explicit
module itself when determining the configuration hash of the implicit
module.

Good point, I hadn’t considered this issue.

Maybe the simplest thing to do is to skip this case for now; we'd only be
saving the space cost of writing out the path to the module cache,

Sounds good.

and I don't think that's a big deal (at least, not compared to the 100K we
waste on a name lookup table for builtins and keywords in each module).

OT, but: Fixing that has been near the bottom of my TODO list for a long
time. IIRC it’s not just a waste of space, because if a system module
defines one of those builtin names (e.g. ceil in tgmath.h) we might find
the wrong one because we take the first one we find that’s up to date.

I did some analysis of the size cost in the context of PR21397, but never
got any production-ready changes out of it.

Ah.

Actually, can we skip this case? What if the user builds a bunch of modules implicitly then starts using some of them explicitly with -fmodule-file. Then we can’t know at build time whether to write a module-cache-relative path or normal path. That makes me think using cache-relative paths won’t be a great solution.

One answer could be:

  1. When we write a module import, we write out the module’s name.
    2.1) When we load an imported module, we first check if there is an override from -fmodule-file for a module.
    2.2) Otherwise, if the module is imported explicitly, we use a stored path, which will be absolute or relative to the working directory (as normal).
    2.3) Otherwise, if the module is imported implicitly, we lookup the path using the hash-specific module cache and the module’s name.
  2. When we load a module explicitly, we figure out the hash-specific module cache directory from the time it was built (either by reconstructing all the options or by writing it out separately in the AST file and then re-loading it), and use that for any implicit imports of the current module.

Which results in:

a) Any module can be moved around individually by using -fmodule-file
b) Implicit imports of explicit modules will look for their .pcm in the location it was found when the explicit module was built.
c) If there are only implicit modules, you can use -fmodules-cache-path and move the whole cache directory around.

Thoughts? I’m not sure how I feel about (b), but (a) and (c) seem good to me.

Hey Richard (& cfe-dev),

Currently if one AST file imports another (e.g. module A imports module
B), we store the absolute path of module B inside module A’s IMPORTS
record. When we know that both files will always be in the same directory,
this wastes space and more importantly prevents moving those modules to
another directory. The latter is very handy when debugging a module bug
for which someone has given you their broken module cache.

When an implicitly built module imports another implicitly built module,
we can rely on the modules always being in the same module cache, and I
think we should switch to a relative path that is either looked up relative
to the current pcm file or the (hash-specific) module cache dir. Do you
think we should do this for explicitly built modules that happen to be in
the same directory?

My initial reaction is that we should preserve the path given in the
-fmodule-file= argument on the command line. If I use
-fmodule-file=x/foo.pcm and explicitly build y/bar.pcm, I think that
y/bar.pcm should say that it finds foo in 'x/foo.pcm’.

This makes sense to me. In that case, we’ll probably need to store
another bit to distinguish “relative to CWD” from “relative to module
cache”, or else -fmodule-file=<some implicitly built module>.pcm might
choose an unexpected file. Alternatively, we could store the ModuleKind
for the module when it was written (as opposed to when it was loaded), I
guess.

If the user then builds with -fmodule-file=z/foo.pcm
-fmodule-file=y/bar.pcm, we should probably ignore the path that was
specified for 'foo' when building 'bar’.

I assume you mean ‘loading bar'.

Err, I mean we should ignore the path for foo that was specified at the
time when bar was built when loading bar.

Ah.

What about implicitly built modules that are imported by explicitly built

modules?

It seems tricky to make that work transparently if the modules have been
relocated. We shouldn't expect that explicitly-built modules are located
anywhere near the module cache, so I guess the best we can do is to look
for such files in the module cache by default (even if the module cache has
moved), and not bother writing out /path/to/module/cache/thing.pcm. If
they've been relocated, then I suppose you could explicitly import them
with -fmodule-file=$foo.

However, we need to be cautious that things can change between explicit
module build and use, so we need to use the parameters from the explicit
module itself when determining the configuration hash of the implicit
module.

Good point, I hadn’t considered this issue.

Maybe the simplest thing to do is to skip this case for now; we'd only be
saving the space cost of writing out the path to the module cache,

Sounds good.

Actually, can we skip this case? What if the user builds a bunch of
modules implicitly then starts using some of them explicitly with
-fmodule-file. Then we can’t know at build time whether to write a
module-cache-relative path or normal path.

We can't know at the build time of which module? Just to make sure we're on
the same page: whether a module file is explicit or implicit is a property
of how it's loaded, not of how it's built. If it's found by -fmodule-file,
then it's explicit and we should write out its path relative to $PWD; if
it's found in the module cache implicitly, then it's implicit and we should
write out its path relative to the cache.

That makes me think using cache-relative paths won’t be a great solution.

One answer could be:

1) When we write a module import, we write out the module’s name.
2.1) When we load an imported module, we first check if there is an
override from -fmodule-file for a module.
2.2) Otherwise, if the module is imported explicitly, we use a stored
path, which will be absolute or relative to the working directory (as
normal).
2.3) Otherwise, if the module is imported implicitly, we lookup the path
using the hash-specific module cache and the module’s name.
3) When we load a module explicitly, we figure out the hash-specific
module cache directory from the time it was built (either by reconstructing
all the options or by writing it out separately in the AST file and then
re-loading it), and use that for any implicit imports of the current module.

This all makes sense to me.

Which results in:

a) Any module can be moved around individually by using -fmodule-file
b) Implicit imports of explicit modules will look for their .pcm in the
location it was found when the explicit module was built.
c) If there are only implicit modules, you can use -fmodules-cache-path
and move the whole cache directory around.

Thoughts? I’m not sure how I feel about (b), but (a) and (c) seem good to
me.

I'm not really sure what (b) means. But (a) and (c) seem like goodness.

I was basically saying that because the “explicitness” of a module can change in subsequent loads, we cannot store just a cache-relative path for implicit imports unless we can reconstruct where the cache itself was at the time of building the importing module. In other words, unless we have not skipped this case.

What terminology do you use for “built on-demand” vs. “built with -emit-module”? A module that is built on-demand is always being implicitly imported at the point it is originally built.

(b) means if you pass -fmodule-file for A, and A has an implicit import for B C and D, you cannot move B, C and D together by just changing -fmodules-cache-path. You need to spell out -fmodule-file for each module individually if you want to move them. This is only interesting in that it is different from (c) where if all of the modules are found implicitly we can move the whole cache at once.

If there’s no objections, I’ll put together a patch for this.

Ben

Hey Richard (& cfe-dev),

Currently if one AST file imports another (e.g. module A imports module
B), we store the absolute path of module B inside module A’s IMPORTS
record. When we know that both files will always be in the same directory,
this wastes space and more importantly prevents moving those modules to
another directory. The latter is very handy when debugging a module bug
for which someone has given you their broken module cache.

When an implicitly built module imports another implicitly built
module, we can rely on the modules always being in the same module cache,
and I think we should switch to a relative path that is either looked up
relative to the current pcm file or the (hash-specific) module cache dir.
Do you think we should do this for explicitly built modules that happen to
be in the same directory?

My initial reaction is that we should preserve the path given in the
-fmodule-file= argument on the command line. If I use
-fmodule-file=x/foo.pcm and explicitly build y/bar.pcm, I think that
y/bar.pcm should say that it finds foo in 'x/foo.pcm’.

This makes sense to me. In that case, we’ll probably need to store
another bit to distinguish “relative to CWD” from “relative to module
cache”, or else -fmodule-file=<some implicitly built module>.pcm might
choose an unexpected file. Alternatively, we could store the ModuleKind
for the module when it was written (as opposed to when it was loaded), I
guess.

If the user then builds with -fmodule-file=z/foo.pcm
-fmodule-file=y/bar.pcm, we should probably ignore the path that was
specified for 'foo' when building 'bar’.

I assume you mean ‘loading bar'.

Err, I mean we should ignore the path for foo that was specified at the
time when bar was built when loading bar.

Ah.

What about implicitly built modules that are imported by explicitly built

modules?

It seems tricky to make that work transparently if the modules have been
relocated. We shouldn't expect that explicitly-built modules are located
anywhere near the module cache, so I guess the best we can do is to look
for such files in the module cache by default (even if the module cache has
moved), and not bother writing out /path/to/module/cache/thing.pcm. If
they've been relocated, then I suppose you could explicitly import them
with -fmodule-file=$foo.

However, we need to be cautious that things can change between explicit
module build and use, so we need to use the parameters from the explicit
module itself when determining the configuration hash of the implicit
module.

Good point, I hadn’t considered this issue.

Maybe the simplest thing to do is to skip this case for now; we'd only
be saving the space cost of writing out the path to the module cache,

Sounds good.

Actually, can we skip this case? What if the user builds a bunch of
modules implicitly then starts using some of them explicitly with
-fmodule-file. Then we can’t know at build time whether to write a
module-cache-relative path or normal path.

We can't know at the build time of which module?

I was basically saying that because the “explicitness” of a module can
change in subsequent loads, we cannot store just a cache-relative path for
implicit imports unless we can reconstruct where the cache itself was at
the time of building the importing module. In other words, unless we have
not skipped this case.

Just to make sure we're on the same page: whether a module file is
explicit or implicit is a property of how it's loaded, not of how it's
built.

What terminology do you use for “built on-demand” vs. “built with
-emit-module”? A module that is built on-demand is always being implicitly
imported at the point it is originally built.

I'm not sure to what extent this is a useful distinction; a "built
on-demand" module is built as if by running another clang instance with
"-emit-module". If you use "-emit-module" without "-o", the .pcm file gets
put into the module cache and is indistinguishable from an implicitly-built
module. (But there's some terminology in case we need to talk about this:
"explicitly- versus implicitly-built module".)

If it's found by -fmodule-file, then it's explicit and we should write out
its path relative to $PWD; if it's found in the module cache implicitly,
then it's implicit and we should write out its path relative to the cache.

That makes me think using cache-relative paths won’t be a great solution.

One answer could be:

1) When we write a module import, we write out the module’s name.
2.1) When we load an imported module, we first check if there is an
override from -fmodule-file for a module.
2.2) Otherwise, if the module is imported explicitly, we use a stored
path, which will be absolute or relative to the working directory (as
normal).
2.3) Otherwise, if the module is imported implicitly, we lookup the path
using the hash-specific module cache and the module’s name.
3) When we load a module explicitly, we figure out the hash-specific
module cache directory from the time it was built (either by reconstructing
all the options or by writing it out separately in the AST file and then
re-loading it), and use that for any implicit imports of the current module.

This all makes sense to me.

Which results in:

a) Any module can be moved around individually by using -fmodule-file
b) Implicit imports of explicit modules will look for their .pcm in the
location it was found when the explicit module was built.
c) If there are only implicit modules, you can use -fmodules-cache-path
and move the whole cache directory around.

Thoughts? I’m not sure how I feel about (b), but (a) and (c) seem good
to me.

I'm not really sure what (b) means. But (a) and (c) seem like goodness.

(b) means if you pass -fmodule-file for A, and A has an implicit import
for B C and D, you cannot move B, C and D together by just changing
-fmodules-cache-path. You need to spell out -fmodule-file for each module
individually if you want to move them. This is only interesting in that it
is different from (c) where if *all* of the modules are found implicitly we
can move the whole cache at once.

I see. There are more general issues with this situation; if we explicitly
import module file A, and module file A implicitly imports module file B
from the module cache, and B is no longer in the cache for whatever reason,
then we have essentially no way to recover; we don't have enough
information to rebuild B, and we can't and shouldn't rebuild A. (And even
if we could rebuild B, we would need a bit-for-bit identical version in
order for A to be able to use it.) So I don't think we need to worry too
much about problems that are specific to this case.

If there’s no objections, I’ll put together a patch for this.

Sounds good to me!