Git Transition status?

Hi all-
I was wondering if anyone knew what the status/schedule of the SVN to git/github transition was? I thought I saw that at the November meeting it was agreed upon, but I'm not sure I saw any progress since?

Thanks,
Erich

Hi,

The main outcome of the BoF had the dev meeting was that we agree’d that moving to GitHub was the best choice forward for LLVM (IIRC only one person in the room expressed concerned about GitHub, but he said he had personal grief with them and nothing specific for LLVM).

The unknown that remains is: will we use a mono-repo or a multi-repo. On this aspect:

  • We got consensus at the BoF that downstream users (i.e. non-contributors) are not impacted by this choice, and we’re not gonna optimize the repository structure for them.
  • My reading of the survey is that the monorepo has a significant lead.
  • My understanding of the dynamic of the discussions and question during the BoF is that monorepo has a significant lead, is likely to satisfy more people, and has a very small number of people concerned about it. On the other hand many people have strong feeling about the multirepo.

Considering all the current tradeoffs, it is likely that we will move-on with a monorepo, even if there are no guarantee or decision made at this point.

The path forward (already engaged) is to engage a prototype phase: we’re building a monorepo and trying to make it usable as much as possible, without making any change or building anything that would commit us to a monorepo (for instance we’re not gonna migrate any bots to it).

The goal of this prototype is that developers can start using a monorepo to try it, and we can evaluate how it plays in practice, outside of theoretical considerations. If anyone finds concerns about a given workflow, we can study what can be improved to address it, or maybe we’ll hit a wall that would show that monorepo can’t address what we think it will.

At some point, if the experiment is conclusive, we should be able to build a larger majority and hopefully reach a consensus that the proposed prototype can be considered viable for development and start planning the actual committing changes.

The monorepo is not totally ready yet, but you can already experience it (I live on day-to-day for my development, and a few other people as well), instructions are in the doc: http://llvm.org/docs/GettingStarted.html#for-developers-to-work-with-a-git-monorepo

I don’t have any schedule to announce, hopefully we can make it all happen in 2017.

Best,

I see, thank you very much for the update! I’m glad it is moving forward.

-Erich

I’ve been working on some scripting for re-converting the svn repository into git.

The existing git conversions are entirely sufficient for day-to-day development purposes, but not up to the standard (at least, my standard) for replacing svn as the authoritative source repository. For one example, clang.git doesn’t actually go back to the first commit of clang, because weird stuff happened early in the svn history that threw off git-svn.

Here’s my work in progress, but there’s still more work to be done:
https://github.com/jyknight/llvm-git-migration (conversion scripts)
https://github.com/jyknight/llvm-monorepo (test repository)

So the Github-Importer screws up the old dark places of the SVN-History and the scripts try to handle this?

Something like that.
FWIW: This is not surprising.

When i moved GCC from CVS to SVN, the older versions of CVS had bugs that the SVN importer couldn’t handle, and i essentially had to rewrite subversion’s cvs2svn to make it work (plus, it was originally so slow it would have taken 2 real-time months to do the import).

On the plus side, once we fixed all the bugs, we also were able to get RCS history into the SVN history :slight_smile:

I had a similar experience when I did the same migration for PHP’s repos in 2009. It took several months of prep and a multi-thousand-line script around cvs2svn, and they still lost some history in the process.

I don’t know of any plus sides, aside from they weren’t on CVS anymore ^^;

It’s safe to say there will always be some rough edges when dealing with so much history.

– Gwynne Raskind

- My reading of the survey is that the monorepo has a significant lead.

That was not my understanding, but we shouldn't be arguing over small
differences in percentages. We had enough of that on both sides of the
Atlantic already in 2016.

- My understanding of the dynamic of the discussions and question during the
BoF is that monorepo has a significant lead, is likely to satisfy more
people, and has a very small number of people concerned about it. On the
other hand many people have strong feeling about the multirepo.

My understanding is that people were happy with the mono-repo as long
as it had the right balance between what's in the mono-repo and what's
left out.

That break down hasn't reached a consensus yet.

Considering all the current tradeoffs, it is likely that we will move-on
with a monorepo, even if there are no guarantee or decision made at this
point.

I wouldn't start betting on the likelihood of anything at this stage.
Right now we need to understand what's the most logical and the least
impacting split for a mono-repo.

From what I gathered at the BoF, most people were happy with the core

(revision sync) repos in the mono and everything else out.

What that means is that they will be physically separated, which is a
very different scenario than what we have today and those issues will
have to be sorted out before any decision.

Also, not everyone agreed on the definition of "core repo".

The path forward (already engaged) is to engage a prototype phase: we’re
building a monorepo and trying to make it usable as much as possible,
without making any change or building anything that would commit us to a
monorepo (for instance we’re not gonna migrate any bots to it).

Is this anywhere we can use? Did I miss the announcement of such a project?

I don’t have any schedule to announce, hopefully we can make it all happen
in 2017.

There will be no schedules to announce without consensus and collaboration.

I haven't seen anything public since the BoF, and I'm a little
surprised that things are happening and not being shared in the
mailing list.

This was a task effort that multiple people have put together around
the mailing list. I don't think this should take any offline form on
the most crucial moment, which is deciding how the repo will look like
and how are we all going to interact with it.

Downstream users said "they will all eventually pay the costs" with
whatever decision upstream takes. They didn't say it was going to be
cheap (most of them said it was not), nor did they say that they'll
abide to whatever a closed group decision had formed.

This is an upstream process and needs to happen upstream, which means
the mailing list. Not socials, not IRC. This needs record, and the
mailing list is the only channel that has that feature and reaches all
developers.

cheers,
--renato

Hi,

So the Github-Importer screws up the old dark places of the SVN-History and the scripts try to handle this?

I don’t think we tried the Github-Importer, but keep in mind that the project structure is a bit special. In SVN we have:

llvm/trunk

llvm/branch

llvm/tags

clang/trunk

clang/branch

clang/tags

And we want to map in the destination repo llvm/trunk to llvm/, clang/trunk to clang, …

We also want to map nicely the author names/emails to commits, possible catching “Patch by:” in the commit message to attribute correctly authorship (while keeping the “committer” field in git.

Also there are some specific cases in the history we want to filter: for instance all of LLVM was committed into LLDB as a zip file at some point. This is already filtered out of the git repo on llvm.org, and we want to filter this out in the monorepo.

Other craziness can be found by looking at the script James wrote: https://github.com/jyknight/llvm-git-migration/blob/master/llvm-svn2git.rules :

  • "Skip the revisions that deleted cfe/cfe/ and moved it to cfe/“
  • “Handle compiler-rt’s initial revision, which was out of trunk/”
  • “Ignore move from gcc-plugin/ to dragonegg/, and re-addition of gcc-plugin”
  • "Some branches are at a different level.”

I’m very impressed by the amount of dedication James is putting into this, and he deserves a big thanks for this archeology work :slight_smile:

Best,

Mehdi

The main outcome of the BoF had the dev meeting was that we agree’d that moving to GitHub was the best choice forward for LLVM (IIRC only one person in the room expressed concerned about GitHub, but he said he had personal grief with them and nothing specific for LLVM).

The unknown that remains is: will we use a mono-repo or a multi-repo. On this aspect:

  • We got consensus at the BoF that downstream users (i.e. non-contributors) are not impacted by this choice, and we’re not gonna optimize the repository structure for them.
  • My reading of the survey is that the monorepo has a significant lead.
  • My understanding of the dynamic of the discussions and question during the BoF is that monorepo has a significant lead, is likely to satisfy more people, and has a very small number of people concerned about it. On the other hand many people have strong feeling about the multirepo.

FWIW, I have spoken to a large number of people about the mono vs multi-repo tradeoffs, and I’m personally convinced that mono repo is the way to go. For a few reasons:

  • Monorepo is the “natural” way to use git. Submodules are possible to use, but add significant complexity.
  • The download size of a mono-repo is manageable, and seems scalable for a project the size of LLVM (including reasonable growth over the next 10 years).
  • As Medhi says, according to surveys and discussions in forums like the LLVM Dev Meeting BoF, most people who care are in favor of mono-repo.
  • The people most impacted by mono-repo are those who want to build just compiler-rt. We want these people to be happy, but they are very few in number, and their benefit needs to be balanced against the benefit for the larger community that builds llvm (and typically clang or another front end).

Overall, it seems clear that either approach could work, but mono seems to win out because it is more popular and more simple. It would require tweaks to LLVM’s cmake system though: instead of deciding to build a subproject based on whether it is checked out, it should instead be based on configuration time flags.

-Chris

“The people most impacted by mono-repo are those who want to build just compiler-rt”

Perhaps more accurate to say “Those who want to contribute only to compiler-rt”?

Those who only want to build compiler-rt can check it out from a slave repo that mirrors commits to compiler-rt in the mono-repo. Some infrastructure is needed for this, but it’s easy and automated.

Some infrastructure is needed for this, but it’s easy and automated

Well sounds like Submodules to me. Hasn’t there been a discussion about it in one of the first Gitposts?

- Monorepo is the “natural” way to use git. Submodules are possible to use, but add significant complexity.

Having used submodules in a couple of projects, I’ve not found them to cause more difficulty than they avoided; however, they do have an issue specifically with GitHub, which is that tarballs don’t include submodules so packages are slightly harder to construct (they must point to two releases).

- The download size of a mono-repo is manageable, and seems scalable for a project the size of LLVM (including reasonable growth over the next 10 years).

The download size of a mono-repo is fine for anyone who would be checking out LLVM today. compiler-rt and libc++ are both useful without any of the rest of LLVM and contributors to libc++ rarely check out anything more than libc++ (perhaps libc++abi) today.

- As Medhi says, according to surveys and discussions in forums like the LLVM Dev Meeting BoF, most people who care are in favor of mono-repo.

From the online surveys, I think the split was roughly 50:50. I’d be very hesitant to regard anything at a BoF as representative of the wider community, as the set of people who have the time and funding to attend a conference is quite distinct from the wider community (particularly for the US DevMeeting, which is right in the middle of university term times). We’ve made this mistake in FreeBSD before.

- The people most impacted by mono-repo are those who want to build just compiler-rt. We want these people to be happy, but they are very few in number, and their benefit needs to be balanced against the benefit for the larger community that builds llvm (and typically clang or another front end).

I believe that the big win for the monorepo is the ability to bisect usefully. It’s currently very difficult to bisect clang, because you can’t bisect clang and llvm independently (LLVM API changes frequently break clang) and they’re in different git repos (or non-enclosing svn subtrees) and so it needs some manual intervention. Having them in the same repo would ensure that they are in sync and make bisecting trivial.

In contrast, there is not (and should not be) tight coupling between LLVM and libc++, libunwind, libc++abi, and compiler-rt. There *may* be ordering requirements (e.g. revision X of libc++ requires c++17 features of revision Y of clang for c++17 features to work), but it is incredibly valuable to bisect these independently to find whether a particular change is a new compiler bug, a new library bug, or an old library bug that is triggered by new compiler behaviour (or an old compiler bug that is triggered by new code).

I would be in favour of a monorepo for everything that links against LLVM libraries and everything else being in separate repos.

Overall, it seems clear that either approach could work, but mono seems to win out because it is more popular and more simple. It would require tweaks to LLVM’s cmake system though: instead of deciding to build a subproject based on whether it is checked out, it should instead be based on configuration time flags.

I believe that most of this works already - you can opt out of building components that are checked out.

David

No. Utterly different to submodules.

Submodules means that every component is in its own repo, and someone who wants to use all the components has to check out the master repo and then regularly update not only the master repo but all the submodules, Then, worse, if they want to commit a change that touches several submodules they have to do individual commits to each submodule, update their master module checkout to incorporate those commits, and then make a commit of the new master module state.

It’s an awful workflow compared to making your changes that touch whatever is needed, and then making one commit.

What I’m talking about is akin to the current situation, with a master svn repo, and automated processes that copy each commit there to both an all-in-one git repo and to a git repo just for that submodule. These mirror git repos are read-only to everything except the mirroring process.

The only difference is that the all-in-one git repo will become the master repo – there will be no svn repo.

  • Monorepo is the “natural” way to use git. Submodules are possible to use, but add significant complexity.

Having used submodules in a couple of projects, I’ve not found them to cause more difficulty than they avoided; however, they do have an issue specifically with GitHub, which is that tarballs don’t include submodules so packages are slightly harder to construct (they must point to two releases).

Another one is the future possibility of “pull request”, which is annoying to get across repository.

  • The download size of a mono-repo is manageable, and seems scalable for a project the size of LLVM (including reasonable growth over the next 10 years).

The download size of a mono-repo is fine for anyone who would be checking out LLVM today. compiler-rt and libc++ are both useful without any of the rest of LLVM and contributors to libc++ rarely check out anything more than libc++ (perhaps libc++abi) today.

  • As Medhi says, according to surveys and discussions in forums like the LLVM Dev Meeting BoF, most people who care are in favor of mono-repo.

From the online surveys, I think the split was roughly 50:50.

I don’t know on what data you’re basis this on. I looked very closely and here are two questions that contradicts your view.

Question: “If we could go back in time and restart the project with today’s technologies, which repository scheme would be best for the LLVM project?”
→ 55 to 36 in favor of the mono-repo

Question: "Assuming mono-repo gets adopted, how do you plan to contribute?”
→ Only 11% were saying before the BoF that they will continue to use split repository (through git-svn, as today), 86% will use the mono-repo.

I’d be very hesitant to regard anything at a BoF as representative of the wider community, as the set of people who have the time and funding to attend a conference is quite distinct from the wider community (particularly for the US DevMeeting, which is right in the middle of university term times). We’ve made this mistake in FreeBSD before.

I believe it representative enough of the people contributing to LLVM that have an opinion on the question. It obviously can’t be all of them in the same room, but with over 400 attendees, the conference seems like a valid signal to me. Especially many of the people that have been very active on this issue were in the room.

  • The people most impacted by mono-repo are those who want to build just compiler-rt. We want these people to be happy, but they are very few in number, and their benefit needs to be balanced against the benefit for the larger community that builds llvm (and typically clang or another front end).

I believe that the big win for the monorepo is the ability to bisect usefully. It’s currently very difficult to bisect clang, because you can’t bisect clang and llvm independently (LLVM API changes frequently break clang) and they’re in different git repos (or non-enclosing svn subtrees) and so it needs some manual intervention. Having them in the same repo would ensure that they are in sync and make bisecting trivial.

In contrast, there is not (and should not be) tight coupling between LLVM and libc++, libunwind, libc++abi, and compiler-rt. There may be ordering requirements (e.g. revision X of libc++ requires c++17 features of revision Y of clang for c++17 features to work), but it is incredibly valuable to bisect these independently to find whether a particular change is a new compiler bug, a new library bug, or an old library bug that is triggered by new compiler behaviour (or an old compiler bug that is triggered by new code).

I would be in favour of a monorepo for everything that links against LLVM libraries and everything else being in separate repos.

Overall, it seems clear that either approach could work, but mono seems to win out because it is more popular and more simple. It would require tweaks to LLVM’s cmake system though: instead of deciding to build a subproject based on whether it is checked out, it should instead be based on configuration time flags.

I believe that most of this works already - you can opt out of building components that are checked out.

Actually you have to opt-in instead of opt-out right now, and I encourage you to try it if you’re contributing to LLVM: http://llvm.org/docs/GettingStarted.html#for-developers-to-work-with-a-git-monorepo

I just gave it a try following the steps for a combination of multiple projects. Initially, it failed to build because it couldn't find llvm-project/libcxx-abi. I symlinked libcxxabi to libcxx-abi, deleted the build directory and tried again. It fails to build, but this time with a bunch of undefined references to __cxa_XXX symbols. I didn't investigate further.

Should this have just worked out of the box?

Thank you,

Steve

Yes, can you post your cmake invocation, I’ll investigate.

Thanks.

Sorry, I should have done that initially. I copied it from the website:

cmake -GNinja ../llvm-project/llvm -DLLVM_ENABLE_PROJECTS="clang;libcxx;compiler-rt"

It builds for me right now on OSX (running `ninja check-all` right now), can you `git pull`, try again from a clean build dir, and send me the trace (including the hash you’re on and your OS / platform as well).

Thanks,

Mehdi

I did a git pull which updated me to f53d70f759a0b87048576bc87729b368a2d767a9, deleted the build directory (and the libcxx-abi symlink I created), and reran cmake and ninja and I'm not getting any errors involving missing libcxx-abi. I wish I'd saved the error log. I'm really confused by this since I can't find any reference to a libcxx-abi. I'm completely willing to believe user error here (although I cannot for the life of me figure out what I could have done).

That said, libcxx still does not build. I get the same undefined references to various __cxa_ functions (e.g., __cxa_allocate_exception, __cxa_throw, __cxa_free_exception) and some vtables (e.g., __cxxabiv1::__si_class_type_info).

I'm running Ubuntu 16.04.1 LTS on x86-64. I haven't tried on OS X.

[nook:~/programming/llvm/llvm-project] (master) steve$ git pull --rebase
remote: Counting objects: 12, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 12 (delta 9), reused 12 (delta 9), pack-reused 0
Unpacking objects: 100% (12/12), done.
From https://github.com/llvm-project/llvm-project
   b1f3d87..f53d70f master -> origin/master
First, rewinding head to replay your work on top of it...
Fast-forwarded master to f53d70f759a0b87048576bc87729b368a2d767a9.
[nook:~/programming/llvm/llvm-project] (master) steve$ git rev-parse HEAD
f53d70f759a0b87048576bc87729b368a2d767a9
[nook:~/programming/llvm/llvm-project] (master) steve$ cd ..
[nook:~/programming/llvm] steve$ mkdir clang-build
[nook:~/programming/llvm] steve$ cd clang-build
[nook:~/programming/llvm/clang-build] steve$ cmake -GNinja ../llvm-project/llvm -DLLVM_ENABLE_PROJECTS="clang;libcxx;compiler-rt"
<snip>
[nook:~/programming/llvm/clang-build] steve$ ninja
<snip failed build>
[nook:~/programming/llvm/clang-build] steve$ uname -a
Linux nook 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[nook:~/programming/llvm/clang-build] steve$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial