RFC: Dealing with out of tree changes and the LLVM git monorepo

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

The main problems I'm running into stem from the fact that this
prototype rewrites all of history from scratch rather than leverage the
existing [official git mirrors]. This makes migrating out-of-tree work
from the official git mirrors to this repo very difficult, since there
is no shared history. Some efforts have gone into [documenting how to
port in-progress patches], but this doesn't attempt to discuss how to
handle more substantial out of tree work.

Issues with integrating the prototype

Issues with integrating the prototype

> ...
> difficult to correlate one to the other or even to tell which is which.
>...
> new set of hashes for commits and no easy way to map them to
> existing references

I would like to point out that while majority of the issues described here are very real,
mapping two commits to each other seems to be rather straightforward by the means
of SVN revision, which is still available in any kind of git-svn conversion I'v seen.

It might present considerable inconvenience when investigating history manually,
but all the kinds of automation is possible (e.g. remapping commits in bug-tracking system,
filtering commit messages to refer to proper commit SHAs etc).

The only configuration when SVN revision is not enough is when you need to map commits
coming from different "subrepos", where there is no one-to-one correspondence between
commits and SVN numbers.

regards,
Fedor.

I’m going to try to stay out of the question of whether or not we should do it this way. (We’ll see if I succeed. :slight_smile:

But if we do decide to do it this way, it would be nice if we’d do an N-way merge when there’s a single SVN commit that affects multiple git repos.

Justin Lebar <jlebar@google.com> writes:

I'm going to try to stay out of the question of whether or not we should do
it this way. (We'll see if I succeed. :slight_smile:

But if we do decide to do it this way, it would be nice if we'd do an N-way
merge when there's a single SVN commit that affects multiple git repos.

The prototype I linked to does this. See for example:

  Merging r114490: "Fixed pr20314-2.c failure, added E, F, p constraint… · bogner/llvm-zipper-prototype@6258012 · GitHub

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

I think it's too late at this point to start considering alternative
monorepo layouts. We're already behind in getting the current monorepo
up and running, and I think discussing and implementing an alternative
will take too long and put our goal of moving off SVN by next year's
development meeting at risk.

Is it possible that the monorepo you have proposed could be used as an
aide to people trying to integrate out-of-tree branches into the current monorepo?
For example, would someone be able to merge their changes into your monorepo
and then cherry-pick them to the current monorepo?

-Tom

Tom Stellard <tstellar@redhat.com> writes:

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

I think it's too late at this point to start considering alternative
monorepo layouts. We're already behind in getting the current monorepo
up and running, and I think discussing and implementing an alternative
will take too long and put our goal of moving off SVN by next year's
development meeting at risk.

The layout here is not at all different, only the process by which the
repo is generated. I strongly believe that a history preserving
conversion is very important if we want to avoid making porting
out-of-tree work horribly disruptive.

Is it possible that the monorepo you have proposed could be used as an
aide to people trying to integrate out-of-tree branches into the
current monorepo?
For example, would someone be able to merge their changes into your monorepo
and then cherry-pick them to the current monorepo?

Cherry picking out of tree branches is not at all practical. If I have a
backend that's been in development for several years and has many
merges, cherry picking doesn't help. We'd probably need a tool that
regenerates the history "as-if" it had been done on the monorepo itself,
but besides being fairly difficult to do that has it's own problems that
I described below.

Tom Stellard <tstellar@redhat.com> writes:

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

I think it's too late at this point to start considering alternative
monorepo layouts. We're already behind in getting the current monorepo
up and running, and I think discussing and implementing an alternative
will take too long and put our goal of moving off SVN by next year's
development meeting at risk.

The layout here is not at all different, only the process by which the
repo is generated. I strongly believe that a history preserving
conversion is very important if we want to avoid making porting
out-of-tree work horribly disruptive.

The process is actually what I'm concerned about here, much more so than
the physical layout of the repo. It takes time to discuss, develop
and debug a new process for automatically syncing from SVN to a new git
repository. We've already gone through all these steps with the existing
monorepo, so switching to something else at this point would be a step
backwards in my opinion.

-Tom

At this point we can still consider it, I highly doubt that waiting a few weeks would jeopardize the one year deadline (that is really not that ambitious).
What we should do though in my opinion is go with strict deadlines: i.e. every stage of discussion should be open for a very limited time.
The current linear repo has the edge, but we for example we could leave this “zipper” proposal open for the next 1 week (or 2 if you want) as an RFC. Unless this alternative gets a high traction then we should close and move with the linear history repo.

After almost two years of more or less stagnation, I feel it’d be unfortunate to rush right now on what I perceive as important design point (especially for downstream users) like this one.

Cheers,

Tom Stellard <tstellar@redhat.com> writes:

Tom Stellard <tstellar@redhat.com> writes:

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

I think it's too late at this point to start considering alternative
monorepo layouts. We're already behind in getting the current monorepo
up and running, and I think discussing and implementing an alternative
will take too long and put our goal of moving off SVN by next year's
development meeting at risk.

The layout here is not at all different, only the process by which the
repo is generated. I strongly believe that a history preserving
conversion is very important if we want to avoid making porting
out-of-tree work horribly disruptive.

The process is actually what I'm concerned about here, much more so than
the physical layout of the repo. It takes time to discuss, develop
and debug a new process for automatically syncing from SVN to a new git
repository. We've already gone through all these steps with the existing
monorepo, so switching to something else at this point would be a step
backwards in my opinion.

I appreciate the amount of effort you and others have put in to get us
this far, but in my opinion these steps are not quite complete. A lot of
people have just started actually trying to merge with the monorepo
prototype since it was announced that it's intended to be the official
one last week. While there's certainly been a lot of discussion about
the monorepo in general in the last couple of years, I really hadn't
seen much serious discussion in public about the actual conversion until
the "New LLVM git repository conversion prototype" thread earlier this
month.

Just to elaborate a bit on why I think this is important, I think the
difference between the two approaches to conversion have to do with what
we consider the real source of truth in our repository history. The
current prototype rebuilds everything with SVN as a source of truth and
throws out the official git mirrors, which sounds nice in theory, but
has pragmatic problems. The reality is that a lot of people have been
basing work off the git mirrors for a number of years now, so throwing
away that history causes real world problems.

Mehdi AMINI <joker.eph@gmail.com> writes:

At this point we can still consider it, I highly doubt that waiting a few
weeks would jeopardize the one year deadline (that is really not that
ambitious).

What we should do though in my opinion is go with strict deadlines: i.e.
every stage of discussion should be open for a very limited time.
The current linear repo has the edge, but we for example we could leave
this "zipper" proposal open for the next 1 week (or 2 if you want) as an
RFC. Unless this alternative gets a high traction then we should close and
move with the linear history repo.

After almost two years of more or less stagnation, I feel it'd be
unfortunate to rush right now on what I perceive as important design point
(especially for downstream users) like this one.

I agree with this. I'd appreciate if we give this a bit of time for
other people to weigh in - I suspect others are hitting the same issues
as I in trying to integrate with this version of the monorepo.

Justin,

Could you show me an example of longer tree to to be migrated?
It’s okay if one is not yours but public in the github.

I suggest we may provide the script to migrate deep tree.

  1. Generate svnrev-hash maps for each the monorepo and other individual.git.
    (It may be delayed until (3))
  2. Do git-fast-export the branch.
  3. Do git-fast-import with substituting out-of-branch hashes.

I am not certain git-fast-export would be mature.
In contrast, I am certain git-fast-import is mature.

ps. I tried the zipper layout several years ago and I concluded it was not useful.
It’s the reason why, in my monorepo, I grafted some commits to each corresponding commits of individual.git.
It just guaranteed my monorepo isn’t orphan.
Note, I don’t think such grafts were really useful.

Takumi

When you say, SVN is the source of truth, I agree, it is, the official git mirrors were never the source of truth. Any time you would’ve had to upstream a patch with the current SVN system, you end up rebasing it and losing the merge history. I don’t see how moving to the monorepo is different. It’ll only be painful for a few years, as you’ve said, which is kind of to be expected.

That said, I haven’t dug into the zipper proposal, maybe it’s not a big imposition. I just never felt that the “official” git mirrors were particularly official, they were always just something for developers to use to get work done, available without too many implied promises of stability.

NAKAMURA Takumi <geek4civic@gmail.com> writes:

Justin,

Could you show me an example of longer tree to to be migrated?
It's okay if one is not yours but public in the github.

Unfortunately my work is a proprietary backend, so I can't share it. It
would take quite a bit of effort to make something artificial that was
realistic.

You could perhaps looks at something like swift if you wanted to
experiment, but I don't know how complex their branching structure is.

I suggest we may provide the script to migrate deep tree.

1) Generate svnrev-hash maps for each the monorepo and other individual.git.
  (It may be delayed until (3))
2) Do git-fast-export the branch.
3) Do git-fast-import with substituting out-of-branch hashes.

I am not certain git-fast-export would be mature.
In contrast, I am certain git-fast-import is mature.

I have doubts about how effective this would be, and even if it works it
means every hash that's recorded in my bug tracker, in my commit
messages, and in release notes becomes invalid.

This seems much worse than the zipper layout to me.

ps. I tried the zipper layout several years ago and I concluded it was not
useful.
It's the reason why, in my monorepo, I grafted some commits to each
corresponding commits of individual.git.
It just guaranteed my monorepo isn't orphan.
Note, I don't think such grafts were really useful.

I'm not sure I understand what problems you found. Have you looked at
the repo with zipper layout I've prototyped at
GitHub - bogner/llvm-zipper-prototype: Prototype LLVM monorepo with a zippered history ?

I’d suggest trying to script something like the following.

For each svn commit;
$ git replace

$ git filter-branch --index-filter script to drop projects archived from the mono repo HEAD…branch1 HEAD…branch2 …

Hi,

Thanks for starting this discussion Justin!

Hi all,

I've spent some time in the last couple of days trying to figure out how
to adopt the [LLVM git monorepo prototype] for an out of tree backend.
TLDR: I'm not convinced that this prototype is the right approach to
converting to the monorepo, and I have a possible alternative.

The main problems I'm running into stem from the fact that this
prototype rewrites all of history from scratch rather than leverage the
existing [official git mirrors]. This makes migrating out-of-tree work
from the official git mirrors to this repo very difficult, since there
is no shared history. Some efforts have gone into [documenting how to
port in-progress patches], but this doesn't attempt to discuss how to
handle more substantial out of tree work.

Issues with integrating the prototype
-------------------------------------

As far as I can tell, my options for trying to integrate with this
monorepo are fairly limited.

If I merge my trees directly into the monorepo prototype at head, I end
up with two copies of every commit, one of which is a monorepo style
commit and one with the singular repo history. These commits are
completely unrelated to each other, and exist in two separate parallel
histories, making it difficult to correlate one to the other or even to
tell which is which.

An arguably cleaner solution would be try to recreate all of my trees'
history artificially as if they were based on the monorepo prototype
history all along, but this has two problems. First, it's a very
significant tooling effort to do this - I'd need to match up several
years of merge points to their corresponding spots in the monorepo
prototype and somehow redo all of the merges in the same ways. Tools
like "rebase --preserve-merges" don't really help here, since they abort
on merge conflicts and ask a human to resolve them again. Even if I were
to come up with tooling that managed this, I'm still left with a
completely new set of hashes for commits and no easy way to map them to
existing references in emails, bug trackers, and release notes.

Finally, there's the option of throwing away all of my history and
applying my out of tree work in a single patch. This makes git-log and
git-blame useless for investigating issues in my codebase for a few
years. It also means that when fixes go into older branches they can't
be merged forward and need to be redone by hand.

All of these have very significant drawbacks, and none of them really
sounds like a good option at all.

We're in this situation. We have over 7 years of git history for our
out-of-tree target and it would be a huge pain and drawback if we were
to lose that history by e.g. needing to apply all our changes as a
single patch to the new monorepo.

We haven't started moving to the monorepo yet so while we haven't hit
the issues in practice yet, we will. Preserving the history from the git
mirrors would surely be beneficial.

An alternative approach
-----------------------

All of these problems could be mitigated if we could preserve the
history of the existing git mirrors when generating the monorepo. There
are two ways to do this.

1. Start the monorepo by subtree-merging the various repos together at
    an arbitrary point in time.

2. "Zip" together the commits in each official git mirror repo by
    merging them into a combined view after each commit.

While I personally don't see a problem with (1), I've heard people claim
that they want to use the monorepo to bisect arbitrarily far back into
history. If this is the case, we'd prefer an approach like (2).

A zippered repository gives us a lot of the benefits of the prototype,
without a lot of the issues that are caused by rewriting history:

- The commits from the official git mirrors exist as they are now, and
   we don't need to deal with changing hashes.

- Out-of-tree branches have all of their history whether they opt in to
   creating a monorepo style history or not

- All of the repo's history is visible as a monorepo by looking only at
   the merge commits. Bisect scripts can easily filter to these.

- The monorepo commits and individual repo commits are easily
   discernible and have a direct link between them in git's DAG, making
   it easy to find one from the other.

To demonstrate this approach, I've put up a snapshot of what LLVM might
look like if we did this, using some scripts that Duncan wrote a while
back to experiment with the idea:

   GitHub - bogner/llvm-zipper-prototype: Prototype LLVM monorepo with a zippered history

I took a quick look at the zipper prototype and I think it looks awesome!

(Then unfortunately gitk flipped out and after 40 minutes it ate 42GB of
memory (and continued grabbing more) but I don't know if that's a
problem that is perhaps solved in a more recent git version than I'm
running or what the problem really is.)

Thanks,
Mikael

Hi,

Thanks for starting this discussion Justin!

> Hi all,
>
> I've spent some time in the last couple of days trying to figure out how
> to adopt the [LLVM git monorepo prototype] for an out of tree backend.
> TLDR: I'm not convinced that this prototype is the right approach to
> converting to the monorepo, and I have a possible alternative.
>
> The main problems I'm running into stem from the fact that this
> prototype rewrites all of history from scratch rather than leverage the
> existing [official git mirrors]. This makes migrating out-of-tree work
> from the official git mirrors to this repo very difficult, since there
> is no shared history. Some efforts have gone into [documenting how to
> port in-progress patches], but this doesn't attempt to discuss how to
> handle more substantial out of tree work.
>
> Issues with integrating the prototype
> -------------------------------------
>
> As far as I can tell, my options for trying to integrate with this
> monorepo are fairly limited.
>
> If I merge my trees directly into the monorepo prototype at head, I end
> up with two copies of every commit, one of which is a monorepo style
> commit and one with the singular repo history. These commits are
> completely unrelated to each other, and exist in two separate parallel
> histories, making it difficult to correlate one to the other or even to
> tell which is which.
>
> An arguably cleaner solution would be try to recreate all of my trees'
> history artificially as if they were based on the monorepo prototype
> history all along, but this has two problems. First, it's a very
> significant tooling effort to do this - I'd need to match up several
> years of merge points to their corresponding spots in the monorepo
> prototype and somehow redo all of the merges in the same ways. Tools
> like "rebase --preserve-merges" don't really help here, since they abort
> on merge conflicts and ask a human to resolve them again. Even if I were
> to come up with tooling that managed this, I'm still left with a
> completely new set of hashes for commits and no easy way to map them to
> existing references in emails, bug trackers, and release notes.
>
> Finally, there's the option of throwing away all of my history and
> applying my out of tree work in a single patch. This makes git-log and
> git-blame useless for investigating issues in my codebase for a few
> years. It also means that when fixes go into older branches they can't
> be merged forward and need to be redone by hand.
>
> All of these have very significant drawbacks, and none of them really
> sounds like a good option at all.
>

We're in this situation. We have over 7 years of git history for our
out-of-tree target and it would be a huge pain and drawback if we were
to lose that history by e.g. needing to apply all our changes as a
single patch to the new monorepo.

We haven't started moving to the monorepo yet so while we haven't hit
the issues in practice yet, we will. Preserving the history from the git
mirrors would surely be beneficial.

We are also in the same situation for our out-of-tree CHERI backend
(GitHub - CTSRD-CHERI/llvm: DO NOT USE. Use llvm-project instead
GitHub - CTSRD-CHERI/clang: DO NOT USE. Use llvm-project instead
GitHub - CTSRD-CHERI/lld: http://llvm.org/git/lld with added CHERI support). I am aware there were some
attempts at converting our repos to a monorepo structure a few years
ago according to
<http://lists.llvm.org/pipermail/llvm-dev/2016-July/102787.html&gt;\.
However, I'm not sure if the script mentioned there can be reused with
the new git monorepo and it seems that it only handles clang. We would
have to also include our forks of llvm,lld,libunwind and libc++.

Thanks,
Alex

I just want to point out that the issue of incompatible history is not new. This has been getting discussed all the way back in July 2016.

http://lists.llvm.org/pipermail/llvm-dev/2016-July/102657.html

As James said in that email:

That we’ll be getting incompatible history has been glossed over, and it is
indeed really important to make it clear and have a good plan there. This
doesn’t only affect actual “forks”, it also affects every single developer
with a local git clone which contains unfinished work.

So, what is the plan with the existing mono-repo implementation? If there isn’t one, then we should strongly consider alternative implementations of the mono-repo.

I also strongly believe we should not allow a schedule to force us to ignore significant problems in the proposals and implementations. Especially ones that we’ve known about for years.

-Chris

While my team doesn’t have one, it’s clear that out-of-tree backends are an important long-standing valuable use-case for downstream consumers of LLVM, and the new monorepo should try very hard NOT to make their lives difficult.

–paulr

Agreed. I also would argue that this problem isn't unique to out-of-tree backends. Generally it could impact any fork that has out-of-tree changes. I think out-of-tree backends is probably the most common type of use case for that, however it will also likely impact a variety of forks of LLVM projects. For example this will likely have impact on the Swift project's forks of LLVM & Clang which have out-of-tree modifications.

-Chris

Justin Bogner via llvm-dev <llvm-dev@lists.llvm.org> writes:

The layout here is not at all different, only the process by which the
repo is generated. I strongly believe that a history preserving
conversion is very important if we want to avoid making porting
out-of-tree work horribly disruptive.

How would an out-of-tree branch be ported with this new approach? Do
you have scripts to do it?

                          -David

Justin,

I thought we may provide yet another zipper repo, that contains individual.git and the monorepo.
I guess it would make easier to interact between your repo and the monorepo.

As guys mentioned, a zipper repo is useful for migration, but I think it would be hard to live in my daily development.

I will experiment.

Takumi