Thanks for starting this discussion Justin!
I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.
The main problems I'm running into stem from the fact that this
prototype rewrites all of history from scratch rather than leverage the
existing [official git mirrors]. This makes migrating out-of-tree work
from the official git mirrors to this repo very difficult, since there
is no shared history. Some efforts have gone into [documenting how to
port in-progress patches], but this doesn't attempt to discuss how to
handle more substantial out of tree work.
Issues with integrating the prototype
As far as I can tell, my options for trying to integrate with this
monorepo are fairly limited.
If I merge my trees directly into the monorepo prototype at head, I end
up with two copies of every commit, one of which is a monorepo style
commit and one with the singular repo history. These commits are
completely unrelated to each other, and exist in two separate parallel
histories, making it difficult to correlate one to the other or even to
tell which is which.
An arguably cleaner solution would be try to recreate all of my trees'
history artificially as if they were based on the monorepo prototype
history all along, but this has two problems. First, it's a very
significant tooling effort to do this - I'd need to match up several
years of merge points to their corresponding spots in the monorepo
prototype and somehow redo all of the merges in the same ways. Tools
like "rebase --preserve-merges" don't really help here, since they abort
on merge conflicts and ask a human to resolve them again. Even if I were
to come up with tooling that managed this, I'm still left with a
completely new set of hashes for commits and no easy way to map them to
existing references in emails, bug trackers, and release notes.
Finally, there's the option of throwing away all of my history and
applying my out of tree work in a single patch. This makes git-log and
git-blame useless for investigating issues in my codebase for a few
years. It also means that when fixes go into older branches they can't
be merged forward and need to be redone by hand.
All of these have very significant drawbacks, and none of them really
sounds like a good option at all.
We're in this situation. We have over 7 years of git history for our
out-of-tree target and it would be a huge pain and drawback if we were
to lose that history by e.g. needing to apply all our changes as a
single patch to the new monorepo.
We haven't started moving to the monorepo yet so while we haven't hit
the issues in practice yet, we will. Preserving the history from the git
mirrors would surely be beneficial.
An alternative approach
All of these problems could be mitigated if we could preserve the
history of the existing git mirrors when generating the monorepo. There
are two ways to do this.
1. Start the monorepo by subtree-merging the various repos together at
an arbitrary point in time.
2. "Zip" together the commits in each official git mirror repo by
merging them into a combined view after each commit.
While I personally don't see a problem with (1), I've heard people claim
that they want to use the monorepo to bisect arbitrarily far back into
history. If this is the case, we'd prefer an approach like (2).
A zippered repository gives us a lot of the benefits of the prototype,
without a lot of the issues that are caused by rewriting history:
- The commits from the official git mirrors exist as they are now, and
we don't need to deal with changing hashes.
- Out-of-tree branches have all of their history whether they opt in to
creating a monorepo style history or not
- All of the repo's history is visible as a monorepo by looking only at
the merge commits. Bisect scripts can easily filter to these.
- The monorepo commits and individual repo commits are easily
discernible and have a direct link between them in git's DAG, making
it easy to find one from the other.
To demonstrate this approach, I've put up a snapshot of what LLVM might
look like if we did this, using some scripts that Duncan wrote a while
back to experiment with the idea:
I took a quick look at the zipper prototype and I think it looks awesome!
(Then unfortunately gitk flipped out and after 40 minutes it ate 42GB of
memory (and continued grabbing more) but I don't know if that's a
problem that is perhaps solved in a more recent git version than I'm
running or what the problem really is.)