RFC: A virtual file system for clang

Hi all,

I've been hacking on a virtual file system for clang and this seemed like
the right time to start getting some feedback. Briefly, the idea is to
interpose a virtual file system layer between llvm::sys::fs and Clang's
FileManager that allows us to mix virtual files/links/etc. with the 'real'
file system in a general way.

Motivation

The use case that I have in mind is to allow a build system to provide a
file/directory layout to clang without having to construct it "for real" on
disk. For example, I am building a project containing two modules, and
module A imports module B. It would be useful if we could bundle up the
headers and module.map file for module B from wherever they may exist in
the source directories and provide clang with a notion of the file layout
of B _as it will be installed_. Right now, I know of two existing ways to
accomplish this:

1) Copy the files into a fake installation during build. This is
unsatisfying, as it requires tracking and copying files every time they are
changed. And diagnostics, debug info, etc. do not refer back to the
original source file.

2) Header maps provide this functionality for header files. However,
header maps work from within the header search logic, which does not extend
well to other kinds of files. They are also insufficient for bundling
modules, as clang needs to see the framework for the module laid out as
described in the module map.

Description

The idea is to abstract the view of the file system using an
AbstractFileSystem class that mimics the llvm::sys::fs interface:

class AbstractFileSystem {
public:
  class Status { ... };
  // openFileForRead
  // status, and maybe 'stat'
  // recursive iteration
  // getBuffer
  // getBufferForOpenFile
  // recursive directory iteration
};

that can be implemented by any concrete file system that we want. Clients
that want to lookup files/directories (notably the FileManager) can operate
on an AbstractFileSystem object. One leaky part of this interface is that
clients that need to care whether they are working with a 'real path' will
need to explicitly ask for it. For example, debug information and
diagnostics should ask for the real path. I suggest putting that
information into the AbstractFIleSystem::Status object.

Some non-goals (at least for a first iteration):
1) File system modification operations (create_directory, rename, etc.).
Clients will continue to use the real file system for these operations,
and we don't intend to detect any conflicts this might create.
2) Completely virtual file buffers that do not exist on disk.

One implementation of the AbstractFileSystem interface would be a wrapper
over the 'real' file system, which would just defer to llvm::sys::fs.

class RealFileSystem : public AbstractFileSystem { ... };

And to provide a unified view of the file system, we can create an overlay
file system, similar to [1].

class OverlayFileSystem : public AbstractFileSystem { ... };

This reminds me a lot of Plan 9. Plan 9 may be a good inspiration for the
primitive API operations (see also the 9p protocol).

-- Sean Silva

The idea is to abstract the view of the file system using an AbstractFileSystem class that mimics the llvm::sys::fs interface:

My only request is to add functionality as you go. llvm:sys:fs initially had a lot more functionality than what was used by llvm or clang and I am sure there still some left.

Please also use the opportunity to modernize the API when possible (using ErrorOr for example).

Cheers,
Rafael

Hi Rafael,

Sorry I missed this earlier,

The idea is to abstract the view of the file system using an AbstractFileSystem class that mimics the llvm::sys::fs interface:

My only request is to add functionality as you go. llvm:sys:fs initially had a lot more functionality than what was used by llvm or clang and I am sure there still some left.

Makes sense. It would be really easy to go nuts and add way more than we’ll ever need.

Please also use the opportunity to modernize the API when possible (using ErrorOr for example).

I think keeping the interface as close to llvm::sys::fs as possible is really helpful for staging this in. It makes it easier to see the correspondence between the real file system and the virtual one, and it makes updating clients easier. If the intention is to modernize llvm::sys::fs as well, then I feel that’s orthogonal to what I want to accomplish. Is the existing interface actually a pain point? It seems pretty easy to use to me.

Ben

Please also use the opportunity to modernize the API when possible (using ErrorOr for example).

I think keeping the interface as close to llvm::sys::fs as possible is really helpful for staging this in. It makes it easier to see the correspondence between the real file system and the virtual one, and it makes updating clients easier. If the intention is to modernize llvm::sys::fs as well, then I feel that’s orthogonal to what I want to accomplish. Is the existing interface actually a pain point? It seems pretty easy to use to me.

Well, it means more code to change when we do fix llvm:sys::fs :frowning:

The main issue with returning error_code is that it is really easy to
ignore the error (like with plain posix functions). With ErrorOr we
get an assert if someone tries to use a result in an error condition.

Cheers,
Rafael

What would you do with the predicate functions? Most (all?) are overloaded like so:

error_code predicate(const Twine &Path, bool &Result);
// overload for clients that don’t differentiate between negative results and errors
bool predicate(const Twine &Path) {
  bool Result;
  return !predicate(Path, Result) && Result;
}

This prevents creating ErrorOr<bool> predicate(const Twine& Path), which results in ambiguity.

Ben

Please also use the opportunity to modernize the API when possible (using ErrorOr for example).

I think keeping the interface as close to llvm::sys::fs as possible is really helpful for staging this in. It makes it easier to see the correspondence between the real file system and the virtual one, and it makes updating clients easier. If the intention is to modernize llvm::sys::fs as well, then I feel that’s orthogonal to what I want to accomplish. Is the existing interface actually a pain point? It seems pretty easy to use to me.

Well, it means more code to change when we do fix llvm:sys::fs :frowning:

The main issue with returning error_code is that it is really easy to
ignore the error (like with plain posix functions). With ErrorOr we
get an assert if someone tries to use a result in an error condition.

Cheers,
Rafael

What would you do with the predicate functions? Most (all?) are overloaded like so:

error_code predicate(const Twine &Path, bool &Result);
// overload for clients that don’t differentiate between negative results and errors
bool predicate(const Twine &Path) {
bool Result;
return !predicate(Path, Result) && Result;
}

This prevents creating ErrorOr<bool> predicate(const Twine& Path), which results in ambiguity.

It probably depends on the predicate. The fist thing to check is if both versions are needed. If there is really use for both, we would need different names or make predicates an exception to the rule.

I don't remember the API well enough to have an opinion right now. I can take a look when I get home.

Let's leave the predicates as is for now then.

Btw, how do you guys plan to expose this via libclang? Have it pass a mapping or actually allow it to provide function pointers?

Ben

Cheers,
Rafael

Current plan is to expose functions to get the VFS description file,
which Clang can parse and build the VFS accordingly.

Nevertheless, I think we can provide both interfaces, if we freeze the
'Status' data structure in libclang, or if we declare that those
functions are exempt from libclang ABI stability. But in that case
one can probably just use the C++ API directly, so I don't see the
function-based interface as important.

Dmitri

Current plan is to expose functions to get the VFS description file,
which Clang can parse and build the VFS accordingly.

Nevertheless, I think we can provide both interfaces, if we freeze the
'Status' data structure in libclang, or if we declare that those
functions are exempt from libclang ABI stability. But in that case
one can probably just use the C++ API directly, so I don't see the
function-based interface as important.

OK, that is a relief :slight_smile:

I was truly afraid we would end up putting part of the current file
API on stone because of the libclang abi stability.

I took a quick look at the API last night. An observation is that 99%
of the users want the simplest version of any api (exists returning
false on error, create_directory that hides if the directory already
existed or not, etc). Unfortunately, quiet a few end up using the more
complex apis.

So what I would suggest for the virtual fs is to start with only the
simple ones. If you do have a use that needs the more general case,
adding that predicate with a longer name is probably a good thing so
that it doesn't get used accidentally.

Is that OK?

Cheers,
Rafael

> Current plan is to expose functions to get the VFS description file,
> which Clang can parse and build the VFS accordingly.
>
> Nevertheless, I think we can provide both interfaces, if we freeze the
> 'Status' data structure in libclang, or if we declare that those
> functions are exempt from libclang ABI stability. But in that case
> one can probably just use the C++ API directly, so I don't see the
> function-based interface as important.

OK, that is a relief :slight_smile:

I was truly afraid we would end up putting part of the current file
API on stone because of the libclang abi stability.

I took a quick look at the API last night. An observation is that 99%
of the users want the simplest version of any api (exists returning
false on error, create_directory that hides if the directory already
existed or not, etc). Unfortunately, quiet a few end up using the more
complex apis.

So what I would suggest for the virtual fs is to start with only the
simple ones. If you do have a use that needs the more general case,
adding that predicate with a longer name is probably a good thing so
that it doesn't get used accidentally.

There seem to be 2 interfaces here: one user interface, and one for
subclassing.

Are you fine with the interface for subclassing being the current
Status-based proposal?

There seem to be 2 interfaces here: one user interface, and one for
subclassing.

Are you fine with the interface for subclassing being the current
Status-based proposal?

The set of virtual functions currently is

virtual llvm::error_code status(const llvm::Twine &Path, Status &Result) = 0;
virtual llvm::error_code openFileForRead(const llvm::Twine &Path,
                                           FileDescriptor &ResultFD) = 0;
virtual llvm::error_code getBufferForFile(
      const llvm::Twine &Name, llvm::OwningPtr<llvm::MemoryBuffer> &Result,
      int64_t FileSize = -1, bool RequiresNullTerminator = true) = 0;
virtual llvm::error_code statusOfOpenFile(FileDescriptor FD, Status &Result);
virtual llvm::error_code
  getBufferForOpenFile(FileDescriptor FD, const llvm::Twine &Name,
                       llvm::OwningPtr<llvm::MemoryBuffer> &Result,
                       int64_t FileSize = -1,
                       bool RequiresNullTerminator = true);

I would probably
* Use ErrorOr, so ErroOr<Status> statusOfOpenFile(...);
* Remove getBufferForFile. It can just be implemented with
getBufferForOpenFile, no?
btw, getBufferForOpenFile is a case where returning error_code is
probably the best we can do right now. ErrorOR<OwningPtr<..> >
effectively requires move semantics.
* Can status be implemented with openFileForRead + statusOfOpenFile?

I would still be uncomfortable declaring Status abi stable, but this
is a fine small set of apis to have.

Cheers,
Rafael

> There seem to be 2 interfaces here: one user interface, and one for
> subclassing.
>
> Are you fine with the interface for subclassing being the current
> Status-based proposal?

The set of virtual functions currently is

virtual llvm::error_code status(const llvm::Twine &Path, Status &Result) =
0;
virtual llvm::error_code openFileForRead(const llvm::Twine &Path,
                                           FileDescriptor &ResultFD) = 0;
virtual llvm::error_code getBufferForFile(
      const llvm::Twine &Name, llvm::OwningPtr<llvm::MemoryBuffer> &Result,
      int64_t FileSize = -1, bool RequiresNullTerminator = true) = 0;
virtual llvm::error_code statusOfOpenFile(FileDescriptor FD, Status
&Result);
virtual llvm::error_code
  getBufferForOpenFile(FileDescriptor FD, const llvm::Twine &Name,
                       llvm::OwningPtr<llvm::MemoryBuffer> &Result,
                       int64_t FileSize = -1,
                       bool RequiresNullTerminator = true);

I would probably
* Use ErrorOr, so ErroOr<Status> statusOfOpenFile(...);
* Remove getBufferForFile. It can just be implemented with
getBufferForOpenFile, no?
btw, getBufferForOpenFile is a case where returning error_code is
probably the best we can do right now. ErrorOR<OwningPtr<..> >
effectively requires move semantics.
* Can status be implemented with openFileForRead + statusOfOpenFile?

In this case, +1 for those ideas...

There seem to be 2 interfaces here: one user interface, and one for
subclassing.

Are you fine with the interface for subclassing being the current
Status-based proposal?

The set of virtual functions currently is

virtual llvm::error_code status(const llvm::Twine &Path, Status &Result) = 0;
virtual llvm::error_code openFileForRead(const llvm::Twine &Path,
                                          FileDescriptor &ResultFD) = 0;
virtual llvm::error_code getBufferForFile(
     const llvm::Twine &Name, llvm::OwningPtr<llvm::MemoryBuffer> &Result,
     int64_t FileSize = -1, bool RequiresNullTerminator = true) = 0;
virtual llvm::error_code statusOfOpenFile(FileDescriptor FD, Status &Result);
virtual llvm::error_code
getBufferForOpenFile(FileDescriptor FD, const llvm::Twine &Name,
                      llvm::OwningPtr<llvm::MemoryBuffer> &Result,
                      int64_t FileSize = -1,
                      bool RequiresNullTerminator = true);

I would probably
* Use ErrorOr, so ErroOr<Status> statusOfOpenFile(…);

Okay.

* Remove getBufferForFile. It can just be implemented with
getBufferForOpenFile, no?

By remove, do you mean make it non-virtual? If so, I’m fine with that. I don’t want to remove it, since the path-based getBuffer functions are commonly used.

btw, getBufferForOpenFile is a case where returning error_code is
probably the best we can do right now. ErrorOR<OwningPtr<..> >
effectively requires move semantics.
* Can status be implemented with openFileForRead + statusOfOpenFile?

Status has to work on directories and other things we cannot open. You can go the other way and just have path-based APIs, but the FIleManager uses open-file operations because it should be faster to open and then stat on a file descriptor than it is to stat a path and then open that path. I think we really want to keep both versions.

* Remove getBufferForFile. It can just be implemented with
getBufferForOpenFile, no?

By remove, do you mean make it non-virtual? If so, I’m fine with that. I don’t want to remove it, since the path-based getBuffer functions are commonly used.

Yes, sorry. Just make it non-virtual. It is a useful convenience.

btw, getBufferForOpenFile is a case where returning error_code is
probably the best we can do right now. ErrorOR<OwningPtr<..> >
effectively requires move semantics.
* Can status be implemented with openFileForRead + statusOfOpenFile?

Status has to work on directories and other things we cannot open. You can go the other way and just have path-based APIs, but the FIleManager uses open-file operations because it should be faster to open and then stat on a file descriptor than it is to stat a path and then open that path. I think we really want to keep both versions.

Ah, OK. I forgot about directories. Yes, lets keep both.

Cheers,
Rafael

I agree, and then some. The pattern we have followed and IMO should
continue following is to have an explicit mapping to a stable enum in the C
interface. It is much easier to remember that you only need to consider ABI
fallout if you're changing the C interfaces.

What are the pros of OverlayFileSystem inside clang compared to unionFS (or aufs) of a RO directory (possibly maintained by an external caching daemon) with the state of the repository and a RW directory with the changes the user is working on?

Thanks,

Maurizio

What are the pros of OverlayFileSystem inside clang compared to unionFS
(or aufs) of a RO directory (possibly maintained by an external caching
daemon) with the state of the repository and a RW directory with the
changes the user is working on?

I'm not sure I understand what you're getting at completely. Can you
elaborate or provide links to what you propose?

I was not really proposing anything, just trying to understand the benefit of a virtual file system inside clang as opposed to an unmodified compiler + something like http://mike-bland.com/2012/10/01/tools.html#tools-blaze-forge-srcfs-objfs, that you’re probably familiar with :slight_smile:

I was not really proposing anything, just trying to understand the benefit
of a virtual file system inside clang as opposed to an unmodified compiler
+ something like
Tools - Mike Bland,
that you're probably familiar with :slight_smile:

That's easy - if you want to run clang inside something that doesn't have
fuse, say, an IDE, or a MapReduce ;), and you want to run clang in parallel
on many files, it's much simpler to not have to set up something on disk
(even if ram-disk), and set up potentially huge directory structures.

And on the other end, some things are just not possible to do well with a
fuse based solution - for example intercepting all file reads in order to
store all files involved in the compilation, so you can store it and later
replay it exactly the same way it was (for added benefit, you can do
cross-platform record/replay runs).

Cheers,
/Manuel