Thanks for the detailed response Sam!
This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
- the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn’t that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there’s a refactoring that could fix it.
Here are the problems I hit that led to this design:
- Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText
Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)
I couldn’t think of a good way to solve this besides passing in configuration at output creation time.
- Some “configurers” may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)
This is “documented” by the call chains of clang::CompilerInstance::createOutputFile.
I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any “contained” OnDiskOutputBackend. This seems pretty heavyweight though.
- OutputDestination has a lot of stuff in it, I’ll need to dig further into the patch to try to understand why
Firstly, Output
has the user-facing abstraction. OutputDestination
has low-level bits. Another reasonable name might be OutputImpl
but OutputDescription
seemed more descriptive.
Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious “interesting” case is in-memory + on-disk (in either order).
- Make sub-classes of
OutputDestination
as small as possible given the above.
Thirdly, there’s some boilerplate related to lifetimes of multiple destinations. Probably having an explicit MirroringOutputDestination
would be better.
As for OutputManager itself, I think this belongs in clang, if it’s needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it’s not obvious that those sorts of concerns will generalize across tools or that there’s much to be gained from sharing code here.
In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.
Probably it’s better to have the OutputBackend return an Output directly (and get rid of the OutputManager).
(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn’t need to exist.)
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It’s tempting to say “not threadsafe, you should lock”, but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.
Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn’t get around to posting it until this AM):
https://reviews.llvm.org/D95628
My intuition is:
Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination
Thread-unsafe:
- Using a single Output/OutputDestination concurrently
This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.
Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it’s inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.
Relatedly, working-directory/relative-path handling should be considered.
Yeah, you’re probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.
Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.
Two other thoughts related to paths:
- I wonder if this abstraction treats the “path” as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)
I think it can. One approach is to use proxy backends:
- Inner layer writes to ‘-’ / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g.,
/name1
and /name2
.)
- Outer layer passes the middle layer a proxy backend that “does the right thing” with the tree. (E.g., writes to
/Users/dexonsmith/name{1,2}
.)
=> If writing on-disk, “the right thing” is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git’s CAS, “the right thing” is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call “git-mktree” when the middle layer is “done” (maybe a callback in the backend-passed-to-the-middle-layer’s destructor).
(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)
I think it’s doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is
Output::getPath()
an abstraction leak?
- Should we have a
createOutput
that doesn’t take a path?
- …
Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps “side” files (such as AST record layouts into “.ast-dump” or textual IR into “.ll”). This avoids needing to know the on-disk file path (“/path/to/output” => “/path/to/output.ll”), or to even know whether there’s a disk.
- How should we virtualize stdout / stderr?
- “‘-’ means stdout” is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I’m not sure what to do with stderr. No one ever “closes” this stream.
- Are there other outputs that don’t have path names?
- Do we need to virtualize llvm::sys::fs::create_directories?
(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
Why doesn’t this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It’s easy enough to join them, when useful: e.g., write to an InMemoryFileSystem
(via an InMemoryOutputBackend
) and install this same FS in the FileManager
(maybe indirectly via an OverlayFileSystem
).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else.
But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.
If you’re going to require the FileSystem and OutputBackend to be linked, then I think they should be the same object.
No, I don’t think that should be a requirement / expectation. It’s a specific requirement for Clang’s implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.
But if it’s mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.
Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang’s job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.
Note: implicit modules doesn’t currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It’s pretty awkward though.