RFC: A builder for CompilerInstance

I have the beginnings of a patch for a CompilerInstance builder. I'd like some early feedback before going further.

## Questions ##

- Is a builder API a good idea?
- Should it look something like this? If not, what?
- What should be in it?
- How should it be staged?

## Motivation ##

My motivation for looking at this is a little tangential: while working on the InMemoryModuleCache recently (e.g., r355778) I realized that this cache should be leveraging the FileManager to dedup paths by inode. However when I tried a naïve fix I was getting crashes, and I had trouble understanding FileManager/VFS lifetime and ownership. Ultimately, I'm hoping to clarify what the invariants are around an instance of CompilerInstance and its FileManager.

## Is a builder API a good idea? ##

A builder API seems a little cleaner than how we currently manage the boilerplate in constructing a CompilerInstance.

The status quo is that each client default-constructs and then calls various mutators like `setInstance` to set it up. However, this API allows clients to swap in different data structures after construction. If clients don't need or use that flexibility, the code for maintaining that flexibility will be brittle, some combination of broken and confusing.

For example, I cleaned up some confusing logic in r357037 in CompilerInstance::setFileManager and CompilerInstance::createFileManager that were trying to adapt to a CompilerInstance's VFS and FileManager changing at arbitrary times and keep them in sync (in that case, it was easy: just use FileManager's VFS directly).

With a builder API, we can lock down CompilerInstance mutators so that it's clear what is actually allowed to change mid-flight and what isn't. It's probably also easier to handle "obvious defaults" in a sane way for clients that only want to specify one or two things (e.g., unit tests).

## Should it look something like this? If not, what? ##

Here's the WIP patch I have:
https://reviews.llvm.org/differential/diff/192381/
(also attached below, but without the -U9999999 context)

The core idea is to remove `CompilerInstance::set*` and `CompilerInstance::create*` in cases where a CompilerInstance never needs more than one version of a data structure over its lifetime. So far the patch removes CompilerInstance::setInvocation, CompilerInstance::setDiagnostics, and CompilerInstance::createDiagnostics.

The meat is in two files:
- CompilerInstanceBuilder.h (⚙ Diff View): a move-only data structure for collecting the parameters for building a CompilerInstance.
- CompilerInstance.cpp (⚙ Diff View): logic to process those parameters at CompilerInstance construction time.

Note on one semantic change I made: CompilerInstance *always* has a DiagnosticsEngine now. I did that because there were tons of sites calling CompilerInstance::createDiagnostics with no arguments. Originally I had a zero-parameter version of CompilerInstanceBuilder::diags that controlled whether to create one, but just doing it all the time seemed to improve the code. My suspicion is that the DiagnosticsEngine is not intended to be optional, it's just hard to sort out how to construct one (without a builder API).

The rest of it all looks like removal of boilerplate to me.

Thoughts on this direction?

## What should be in it? ##

My personal goal here is to pull in the prerequisites for removing CompilerInstance::setFileManager. It's possible this won't be possible, but if not I'll figure out why not in the process.

What else should be in there? Here are some options:

- FileManager (I hope so!)
- SourceManager (I think so!)
- Preprocessor?
- TargetInfo?
- Sema (probably not, it seems truly optional)
- ModuleManager (probably not?)

## How should it be staged? ##

Assuming this directly looks good, I'm not sure how to stage this. Should I create a monolithic patch and get it reviewed as a whole, or post a patch for a skeletal builder and then patches for adding arguments to it incrementally?

The latter is the usual approach, and is what I've been following locally. I have two patches:

- Add a builder that handles the current constructor arguments: PCHContainerOps and InMemoryModuleCache.
- Add a CompilerInvocation and DiagnosticsEngine to the builder and remove createDiagnostics, setDiagnostics, and setInvocation. The construction of these two is quite intertwined so there didn't seem to be a clean way to do this separately.

I'm hesitating slightly since I'll be touching roughly the same lines of code in every incremental patch, but I'm still inclined to go that way so that incremental progress doesn't bitrot.

Thanks for any feedback!
Duncan

wip-compiler-instance-builder.patch (48.4 KB)

ping

I have the beginnings of a patch for a CompilerInstance builder. I’d like some early feedback before going further.

Questions

  • Is a builder API a good idea?
  • Should it look something like this? If not, what?
  • What should be in it?
  • How should it be staged?

Motivation

My motivation for looking at this is a little tangential: while working on the InMemoryModuleCache recently (e.g., r355778) I realized that this cache should be leveraging the FileManager to dedup paths by inode. However when I tried a naïve fix I was getting crashes, and I had trouble understanding FileManager/VFS lifetime and ownership. Ultimately, I’m hoping to clarify what the invariants are around an instance of CompilerInstance and its FileManager.

Is a builder API a good idea?

A builder API seems a little cleaner than how we currently manage the boilerplate in constructing a CompilerInstance.

The status quo is that each client default-constructs and then calls various mutators like setInstance to set it up. However, this API allows clients to swap in different data structures after construction. If clients don’t need or use that flexibility, the code for maintaining that flexibility will be brittle, some combination of broken and confusing.

For example, I cleaned up some confusing logic in r357037 in CompilerInstance::setFileManager and CompilerInstance::createFileManager that were trying to adapt to a CompilerInstance’s VFS and FileManager changing at arbitrary times and keep them in sync (in that case, it was easy: just use FileManager’s VFS directly).

With a builder API, we can lock down CompilerInstance mutators so that it’s clear what is actually allowed to change mid-flight and what isn’t. It’s probably also easier to handle “obvious defaults” in a sane way for clients that only want to specify one or two things (e.g., unit tests).

Should it look something like this? If not, what?

Here’s the WIP patch I have:
https://reviews.llvm.org/differential/diff/192381/
(also attached below, but without the -U9999999 context)

The core idea is to remove CompilerInstance::set* and CompilerInstance::create* in cases where a CompilerInstance never needs more than one version of a data structure over its lifetime. So far the patch removes CompilerInstance::setInvocation, CompilerInstance::setDiagnostics, and CompilerInstance::createDiagnostics.

The meat is in two files:

Note on one semantic change I made: CompilerInstance always has a DiagnosticsEngine now. I did that because there were tons of sites calling CompilerInstance::createDiagnostics with no arguments. Originally I had a zero-parameter version of CompilerInstanceBuilder::diags that controlled whether to create one, but just doing it all the time seemed to improve the code. My suspicion is that the DiagnosticsEngine is not intended to be optional, it’s just hard to sort out how to construct one (without a builder API).

The rest of it all looks like removal of boilerplate to me.

Thoughts on this direction?

This seems like a nice idea to me. Building and configuring a CompilerInstance is certainly a mess right now, and this seems to make it cleaner and harder to get wrong.

What should be in it?

My personal goal here is to pull in the prerequisites for removing CompilerInstance::setFileManager. It’s possible this won’t be possible, but if not I’ll figure out why not in the process.

What else should be in there? Here are some options:

  • FileManager (I hope so!)
  • SourceManager (I think so!)
  • Preprocessor?
  • TargetInfo?
  • Sema (probably not, it seems truly optional)
  • ModuleManager (probably not?)

I think it’d make sense to see how far you can get with the builder pattern. Some of the above need to be created by the FrontendAction, possibly once for each input, and might not fit well into a builder-style approach – but maybe we’re missing a layer there, and there should be some separate object representing the specific action invocation (whose lifetime is the duration of an iteration of the BeginSourceFile/Execute/EndSourceFile loop in CompilerInstance::ExecuteAction) rather than lumping per-action state into the CompilerInvocation and resetting it between each action invocation? At least right now, the file manager and source manager are torn down and recreated for each action execution.

I think we’ll need to figure out:

  1. Do we allow multiple actions to be run on a single compiler instance, and if so, what does that mean? (Are they run in the same context or different contexts?)
  2. Do we allow multiple inputs to be passed to a single frontend action, and if so, what does that mean? (Are they run in the same context or different contexts?)

Right now, the answer to both 1 and 2 is yes, and they’re run in different contexts (we tear down and recreate most of the compiler instance each time). I think that’s probably a mistake, particularly for (2): the one frontend action that has a reason to support multiple input files is GenerateHeaderModuleAction, and it has to work around the default behavior of the CompilerInstance so that it can consume all the inputs in the same action invocation. I’d also like to eventually add a “final build” mode to clang, to parse multiple source files into the same AST and generate a single IR module from them (but with them properly isolated from each other by module visibility, to avoid the usual problems with that mode), and again that suggests that multiple input files to an action should be processed as part of the same AST / Sema / CodeGen context.

Maybe the answer to 1 should be no. In which case we might find that we end up wanting just

runAction(CompilerInstanceBuilder, FrontendAction)

and no exposed CompilerInstance object at all.

How should it be staged?

Assuming this directly looks good, I’m not sure how to stage this. Should I create a monolithic patch and get it reviewed as a whole, or post a patch for a skeletal builder and then patches for adding arguments to it incrementally?

The latter is the usual approach, and is what I’ve been following locally. I have two patches:

  • Add a builder that handles the current constructor arguments: PCHContainerOps and InMemoryModuleCache.
  • Add a CompilerInvocation and DiagnosticsEngine to the builder and remove createDiagnostics, setDiagnostics, and setInvocation. The construction of these two is quite intertwined so there didn’t seem to be a clean way to do this separately.

I’m hesitating slightly since I’ll be touching roughly the same lines of code in every incremental patch, but I’m still inclined to go that way so that incremental progress doesn’t bitrot.

The incremental path sounds preferable to me. I think you’re far enough along that we can imagine what a further-progressed result would look like without having a complete patch series. Touching the same lines repeatedly seems fine :slight_smile:

Thanks for the feedback! Things have probably bitrotted since March, but I’ll try to get back to this soon, rebase, and start sending patches.

I have the beginnings of a patch for a CompilerInstance builder. I’d like some early feedback before going further.

Questions

  • Is a builder API a good idea?
  • Should it look something like this? If not, what?
  • What should be in it?
  • How should it be staged?

Motivation

My motivation for looking at this is a little tangential: while working on the InMemoryModuleCache recently (e.g., r355778) I realized that this cache should be leveraging the FileManager to dedup paths by inode. However when I tried a naïve fix I was getting crashes, and I had trouble understanding FileManager/VFS lifetime and ownership. Ultimately, I’m hoping to clarify what the invariants are around an instance of CompilerInstance and its FileManager.

Is a builder API a good idea?

A builder API seems a little cleaner than how we currently manage the boilerplate in constructing a CompilerInstance.

The status quo is that each client default-constructs and then calls various mutators like setInstance to set it up. However, this API allows clients to swap in different data structures after construction. If clients don’t need or use that flexibility, the code for maintaining that flexibility will be brittle, some combination of broken and confusing.

For example, I cleaned up some confusing logic in r357037 in CompilerInstance::setFileManager and CompilerInstance::createFileManager that were trying to adapt to a CompilerInstance’s VFS and FileManager changing at arbitrary times and keep them in sync (in that case, it was easy: just use FileManager’s VFS directly).

With a builder API, we can lock down CompilerInstance mutators so that it’s clear what is actually allowed to change mid-flight and what isn’t. It’s probably also easier to handle “obvious defaults” in a sane way for clients that only want to specify one or two things (e.g., unit tests).

Should it look something like this? If not, what?

Here’s the WIP patch I have:
https://reviews.llvm.org/differential/diff/192381/
(also attached below, but without the -U9999999 context)

The core idea is to remove CompilerInstance::set* and CompilerInstance::create* in cases where a CompilerInstance never needs more than one version of a data structure over its lifetime. So far the patch removes CompilerInstance::setInvocation, CompilerInstance::setDiagnostics, and CompilerInstance::createDiagnostics.

The meat is in two files:

Note on one semantic change I made: CompilerInstance always has a DiagnosticsEngine now. I did that because there were tons of sites calling CompilerInstance::createDiagnostics with no arguments. Originally I had a zero-parameter version of CompilerInstanceBuilder::diags that controlled whether to create one, but just doing it all the time seemed to improve the code. My suspicion is that the DiagnosticsEngine is not intended to be optional, it’s just hard to sort out how to construct one (without a builder API).

The rest of it all looks like removal of boilerplate to me.

Thoughts on this direction?

This seems like a nice idea to me. Building and configuring a CompilerInstance is certainly a mess right now, and this seems to make it cleaner and harder to get wrong.

What should be in it?

My personal goal here is to pull in the prerequisites for removing CompilerInstance::setFileManager. It’s possible this won’t be possible, but if not I’ll figure out why not in the process.

What else should be in there? Here are some options:

  • FileManager (I hope so!)
  • SourceManager (I think so!)
  • Preprocessor?
  • TargetInfo?
  • Sema (probably not, it seems truly optional)
  • ModuleManager (probably not?)

I think it’d make sense to see how far you can get with the builder pattern. Some of the above need to be created by the FrontendAction, possibly once for each input, and might not fit well into a builder-style approach – but maybe we’re missing a layer there, and there should be some separate object representing the specific action invocation (whose lifetime is the duration of an iteration of the BeginSourceFile/Execute/EndSourceFile loop in CompilerInstance::ExecuteAction) rather than lumping per-action state into the CompilerInvocation and resetting it between each action invocation?

Another layer could help. I’ll think about this a bit.

At least right now, the file manager and source manager are torn down and recreated for each action execution.

The FileManager is also reused between module generation actions during implicit module builds.

I think we’ll need to figure out:

  1. Do we allow multiple actions to be run on a single compiler instance, and if so, what does that mean? (Are they run in the same context or different contexts?)
  2. Do we allow multiple inputs to be passed to a single frontend action, and if so, what does that mean? (Are they run in the same context or different contexts?)

Right now, the answer to both 1 and 2 is yes, and they’re run in different contexts (we tear down and recreate most of the compiler instance each time). I think that’s probably a mistake, particularly for (2): the one frontend action that has a reason to support multiple input files is GenerateHeaderModuleAction, and it has to work around the default behavior of the CompilerInstance so that it can consume all the inputs in the same action invocation. I’d also like to eventually add a “final build” mode to clang, to parse multiple source files into the same AST and generate a single IR module from them (but with them properly isolated from each other by module visibility, to avoid the usual problems with that mode), and again that suggests that multiple input files to an action should be processed as part of the same AST / Sema / CodeGen context.

Maybe the answer to 1 should be no. In which case we might find that we end up wanting just

runAction(CompilerInstanceBuilder, FrontendAction)

and no exposed CompilerInstance object at all.

Interesting, that would be pretty clean.

How should it be staged?

Assuming this directly looks good, I’m not sure how to stage this. Should I create a monolithic patch and get it reviewed as a whole, or post a patch for a skeletal builder and then patches for adding arguments to it incrementally?

The latter is the usual approach, and is what I’ve been following locally. I have two patches:

  • Add a builder that handles the current constructor arguments: PCHContainerOps and InMemoryModuleCache.
  • Add a CompilerInvocation and DiagnosticsEngine to the builder and remove createDiagnostics, setDiagnostics, and setInvocation. The construction of these two is quite intertwined so there didn’t seem to be a clean way to do this separately.

I’m hesitating slightly since I’ll be touching roughly the same lines of code in every incremental patch, but I’m still inclined to go that way so that incremental progress doesn’t bitrot.

The incremental path sounds preferable to me. I think you’re far enough along that we can imagine what a further-progressed result would look like without having a complete patch series. Touching the same lines repeatedly seems fine :slight_smile:

Thanks for the feedback! Things have probably bitrotted since March, but I’ll try to get back to this soon, rebase, and start sending patches.

It didn’t bitrot too much and I found some time this afternoon to rebase. Sent:
https://reviews.llvm.org/D62457
https://reviews.llvm.org/D62458