[RFC] One or many git repositories?

As someone that has worked with both gcc and llvm,

One thing about gcc that drives me bat-guano-crazy is that

First you check out gcc, try to build it, and find that you also

Need mpc, so you check that out and try to build it, and

Find out you also need gmp, so you check that out and try

To build it, and find out that you also need mpfr, ……

IE I’m in favor of a mono-repository.

Also I’ll ask that for a change that spans multiple

projects (eg clang, llvm, lto) simultaneously is there

any alternative to mono-repository that supports

an atomic commit ??? (IMHO maintaining repo

integrity and consistency is the number one priority)

–Peter Lawrence.

Here is the draft: https://reviews.llvm.org/D24167

I can understand your frustration, but these are all completely external libraries, and it does not really make sense to include this into any mono-repository.

For example, you are also dependent on libc headers, are you going to include these into your repository? And kernel headers? The end result will possibly include half of GitHub into that mono-repository... :slight_smile:

So as usual, for any open source project, read the requirements first, and install those from whatever your local package management system is.

If LLVM is going to use a mono-repository, it should only include LLVM components, in my opinion.

-Dimitry

This seems a good principle in general, but note that we already included external projects in the repo. Out-of-my-head right now I can think of the google tests library, and ISL.

+1 for keeping it separate.

One can easily set up a git subproject structure if the need is pressing...

Patrice

Bad current state of things should not be an excuse to make it worse.
Especially that we're finally nearing being able to build various LLVM
projects separately.

+1 for keeping it separate.

Can you clarify what you referring to specifically?
This sub thread (the last 4 messages) started with a mention of GCC dependencies. It is not clear to me how to relate to llvm now.

Mehdi

Someone mentioned llvm in a mono repository below…

Right, we actually have a proposal to take what is in the current SVN repo here: http://llvm.org/svn/llvm-project/ and migrate this to a single repository.
I was not sure if you were referring to this proposal (monorepo) or to the recent emails about “external libraries” that GCC uses like gmp and mpfr.

You can find more details here: https://reviews.llvm.org/D24167

If you have some good reasons why you would think a proposal would be problematic to you, or one would better fit your workflow, feel free to expose them now.

Best,

Ok Mehdi.

Ok Mehdi.

I’m obviously coming into this pretty late, but from a totally neutral user perspective having
recently rebuilt llvm and it’s components on macOS Sierra. I think I am interpreting mono versus
multi repo correctly as “single repository for all llvm components/modules” vs. “multiple
(separate) repositories”. So here’s a humble perspective from an outsider: while building
all of the components separately was a slight (very) inconvenience adding a few more terminal
commands to my workflow, it did force me to RTFM and digest the project’s architecture and
framework. In the end, I have a much better understanding of how all this connects.
Again, the impact on day to day development and maintenance is best left to you guys and
gals, but retaining a multiple logical delineation between modules works for me.

Thanks for sharing your thoughts and experience!

Has the move to native git/github been formally approved yet? I use Sourcetree, moving away from git svn would be a plus…

No, we’re still formalizing the proposals, then a survey will be sent out, then we’ll probably discuss this at the next llvm-dev meeting, then I don’t know :slight_smile:

Mehdi Amini via llvm-dev <llvm-dev@lists.llvm.org> writes:

Right, we actually have a proposal to take what is in the current SVN
repo here: http://llvm.org/svn/llvm-project/ and migrate this to a
single repository.
I was not sure if you were referring to this proposal (monorepo) or to
the recent emails about “external libraries” that GCC uses like gmp
and mpfr.

You can find more details here: ⚙ D24167 Moving to GitHub - Unified Proposal

If you have some good reasons why you would think a proposal would be
problematic to you, or one would better fit your workflow, feel free
to expose them now.

It could be problematic for us depending on how the monorepository is
structured. We reference the LLVM git repository directly and use it to
migrate to new versions, pick patches, etc. If LLVM proper were part of
a larger repository that becomes more difficult to do because the commit
file paths won't match. We'd be back to essentially manual diff+patch
which is quite a step backward from the smoth git-oriented process we
use now.

The document says that the individual git repositories will remain.
Does that mean the monorepository is using git-submodule to manage the
aggregate repository? If so that should work for us. I'm more
concerned about the case where the individual repositories' histories
were interwoven into a single repository and the individual repositories
went away.

I have extensive experience transitioning a very large project from a
set of individual repositories to a single repository where we interwove
the individual histories. It was the right direction for us but I don't
think it would be for LLVM.

I completely understand the benefits of a monorepository. One of the
biggest for us was the ability to git-bisect across components. How
does git-bisect work with submodules? I have very little experience
with submodules but would like to learn more.

                              -David

Hi,

Mehdi Amini via llvm-dev <llvm-dev@lists.llvm.org> writes:

Right, we actually have a proposal to take what is in the current SVN
repo here: http://llvm.org/svn/llvm-project/ and migrate this to a
single repository.
I was not sure if you were referring to this proposal (monorepo) or to
the recent emails about “external libraries” that GCC uses like gmp
and mpfr.

You can find more details here: https://reviews.llvm.org/D24167

If you have some good reasons why you would think a proposal would be
problematic to you, or one would better fit your workflow, feel free
to expose them now.

It could be problematic for us depending on how the monorepository is
structured. We reference the LLVM git repository directly and use it to
migrate to new versions, pick patches, etc. If LLVM proper were part of
a larger repository that becomes more difficult to do because the commit
file paths won’t match. We’d be back to essentially manual diff+patch
which is quite a step backward from the smoth git-oriented process we
use now.

Can you clarify what you mean? Which part of the process would quite manual patch that wouldn’t otherwise?

The document says that the individual git repositories will remain.
Does that mean the monorepository is using git-submodule to manage the
aggregate repository?

First, have you read this document: https://reviews.llvm.org/D24167 ?

TLDR: The answer is no: you have to see it as it is today, i.e. a single SVN repo containing all the sub-projects, and “exports” in individual repositories.
The same thing after: a single git repo containing all the subprojects side-by-side and the same “exports” in individual repositories.

If so that should work for us. I’m more
concerned about the case where the individual repositories’ histories
were interwoven into a single repository and the individual repositories
went away.

I have extensive experience transitioning a very large project from a
set of individual repositories to a single repository where we interwove
the individual histories. It was the right direction for us but I don’t
think it would be for LLVM.

I completely understand the benefits of a monorepository. One of the
biggest for us was the ability to git-bisect across components. How
does git-bisect work with submodules? I have very little experience
with submodules but would like to learn more.

Fairly easy, the document mentions it in the examples.

Mehdi Amini <mehdi.amini@apple.com> writes:

    It could be problematic for us depending on how the monorepository
    is structured. We reference the LLVM git repository directly and
    use it to migrate to new versions, pick patches, etc. If LLVM
    proper were part of a larger repository that becomes more
    difficult to do because the commit file paths won't match. We'd be
    back to essentially manual diff+patch which is quite a step
    backward from the smoth git-oriented process we use now.

Can you clarify what you mean? Which part of the process would quite
manual patch that wouldn’t otherwise?

If the monorepository is not using submodules but is instead a weaving
of the histories of each component, that means each tree item pointing
to a blob will have a different path. For example,
lib/Target/X86/X86InstrInfo.cpp would become
llvm/lib/Target/X86/X86InstrInfo.cpp or something similar. IME git
doesn't deal well with applying changes to blobs that exist in different
paths in the repository. That makes sense since the hashes directly
depend on the information in the trees.

    The document says that the individual git repositories will
    remain. Does that mean the monorepository is using git-submodule
    to manage the aggregate repository?

First, have you read this document: ⚙ D24167 Moving to GitHub - Unified Proposal ?

Yes, though I was only able to figure out how to see an actual document
by clicking "download raw diff." I'm not sure that's giving me the
latest version. Is there another convenient way to view the document,
preferable with the Markdown rendered?

It's not completely clear to me how the monorepository would be created,
and thus, how it would be structured. I understand each component gets
its own subdirectory. I'm talking about how the underlying history is
represented.

TLDR: The answer is no: you have to see it as it is today, i.e. a
single SVN repo containing all the sub-projects, and “exports” in
individual repositories.

So the SVN version isn't using externals? I haven't ever looked at that
repository. I didn't even know it existed until reading the document.

The same thing after: a single git repo containing all the subprojects
side-by-side and the *same* “exports” in individual repositories.

How are those exports managed? Do you use a tool to filter the history
for a directory in the monorepository and then export that to its own
repository?

    I completely understand the benefits of a monorepository. One of
    the biggest for us was the ability to git-bisect across
    components. How does git-bisect work with submodules? I have very
    little experience with submodules but would like to learn more.
    
Fairly easy, the document mentions it in the examples.

Ok, I probably skimmed that part since it wasn't directly related to
describing how the repository would be structured. I'll go back and
read it in more detail.

Thanks!

                              -David

Sure, what about a PDF?

GitHubMove.pdf (217 KB)

Mehdi Amini via llvm-dev <llvm-dev@lists.llvm.org> writes:

First, have you read this document: ⚙ D24167 Moving to GitHub - Unified Proposal ?

TLDR: The answer is no: you have to see it as it is today, i.e. a
single SVN repo containing all the sub-projects, and “exports” in
individual repositories.

The same thing after: a single git repo containing all the subprojects
side-by-side and the *same* “exports” in individual repositories.

Sorry, I sent my earlier reply today before I intended to.

After going back and reading the proposal again, I think I understand
the plan. I haven't used the SVN repository for years so I was thinking
in terms of git, that you'd take the existing git mirrors and combine
them (visa submodule or some other mechanism). I understand now the
proposal is to take the SVN root and export all of that as one giant git
repository. Is that correct?

If so, that raises a number of questions for me that aren't directly
addressed in the document as far as I can see:

1. How are the individual component git mirrors going to be maintained?
   
If a commit goes to the monorepository, what is going to extract the
relevant bits and commit them to the individual mirrors? The document
notes that with a monorepository a single commit can touch multiple
projects (that's good!) but something has to extract the parts of that
commit that are relevant to each subproject and then send those parts to
the subproject repository. There are tools to do this and I think
git-subtree is a good candidate [disclosure: I am the git-subtree
maintainer] but I'm just curious what's being considered as a solution.

2. Is there any consideration for restructuring the directory layout?

The document has this to say about checking out multiple components:

**Monorepo Proposal**

The repository contains natively the source for every sub-projects at the right
revision, which makes this straightforward::

  git clone https://github.com/llvm/llvm-projects.git llvm
  cd llvm
  git checkout $REVISION

As before, at this point clang, llvm, and libcxx are stored in directories
alongside each other.

The problem here is that for the build, clang wants to be in llvm/tools
and other components want to be in other places. Should the
monorepository just be structured to have everything in its correct
place for building? My inclination is to say "no" because it reduces
the visibility of the subprojects, but what are the alternatives? There
are two that come to mind off the top of my head, 1) include symlinks in
the repository or 2) change the build so all components can live at the
top level.

I think it's important to think about these kinds of questions because
once a repository layout has been settled on, it's hard to change. Yes,
it is relatively easy to move entire directories to new places in git,
but that not only would require changes to whatever entity updates the
subproject repositories, it's potentially a huge social issue, which are
typically the most difficult problems to address. :slight_smile:

3. How are the subproject repositories going to be created/migrated?

The individual subproject repositories will have to be created from
scratch after the monrepository is created, right? We can't just
transition the existing git mirrors to the new setup, correct? A
subproject repository reboot would involve some not insignificant pain
for downstream users because their git histories are suddenly invalid.
They would have to fetch a completely different repository and integrate
it into whatever they have.

If there is some way to maintain the existing git mirrors and layer new
monorepository commits on top of the existing history that would be
fantastic. I believe it is technically possible (I might need to add
some enhancements to git-subtree :)) but I don't know if anyone has
explored this. I would love to be told you all have the answers
already. :slight_smile:

Bisecting

For the multirepository proposal, the document talks about having the
git-bisect run script update each submodule during bisection. I suppose
that will work but the bisection would only report that the failure
exists at a particular commit in the umbrella repository, implying a
bunch of different commits, one for each subproject. It wouldn't really
point to a particular subproject as being the culprit, correct? The
document even hints at this: "it is possible that one commit in the
umbrella repository includes multiple commits in the sub-projects"

That's what I was getting at with my submodule bisect question. It can
only bisect to a granularity of "one of these subprojects at their
respective commits caused the problem." With a true monorepository
bisect can drill down to the exact commit within a subproject or across
multiple subprojects if the commit touched multiple subprojects. To me
this is a giant advantage of a non-submodule-based monorepository, which
I think is what the monorepository proposal is.

If everything I've written here is generally correct, I think the
monorepository will work for us, as long as each subproject repository
is maintained at a granularity of one subproject commit per commit to
the corresponding directory in the monorepository (i.e. full history is
maintained).

Thanks for you work on this. This kind of work is crucially important
but often unrecognized and underappreciated.

                                 -David

Mehdi Amini via llvm-dev <llvm-dev@lists.llvm.org> writes:

First, have you read this document: https://reviews.llvm.org/D24167 ?

TLDR: The answer is no: you have to see it as it is today, i.e. a

single SVN repo containing all the sub-projects, and “exports” in

individual repositories.

The same thing after: a single git repo containing all the subprojects

side-by-side and the same “exports” in individual repositories.

Sorry, I sent my earlier reply today before I intended to.

After going back and reading the proposal again, I think I understand
the plan. I haven’t used the SVN repository for years so I was thinking
in terms of git, that you’d take the existing git mirrors and combine
them (visa submodule or some other mechanism). I understand now the
proposal is to take the SVN root and export all of that as one giant git
repository. Is that correct?

Yes

If so, that raises a number of questions for me that aren’t directly
addressed in the document as far as I can see:

  1. How are the individual component git mirrors going to be maintained?

Just exactly as they are today.

If a commit goes to the monorepository, what is going to extract the
relevant bits and commit them to the individual mirrors? The document
notes that with a monorepository a single commit can touch multiple
projects (that’s good!) but something has to extract the parts of that
commit that are relevant to each subproject and then send those parts to
the subproject repository.

Right, but note that it is already the case today, some people are already using SVN to commit to clang and LLVM at the same time, and the same commit in SVN will result in one commit in the llvm git repo and another commit in the clang repo.

There are tools to do this and I think
git-subtree is a good candidate [disclosure: I am the git-subtree
maintainer] but I’m just curious what’s being considered as a solution.

Well we haven’t decided on anything for the official mirrors. It looks like you’re in a good position to help designing how subtree could help here :slight_smile:
(I have a fairly good understanding of git, but very limited knowledge of subtree)
Anyway I hope will be able to put scripts in the repo so that anyone downstream can split the repo independently of official mirrors.

  1. Is there any consideration for restructuring the directory layout?

The document has this to say about checking out multiple components:

Monorepo Proposal

The repository contains natively the source for every sub-projects at the right

revision, which makes this straightforward::

git clone https://github.com/llvm/llvm-projects.git llvm

cd llvm

git checkout $REVISION

As before, at this point clang, llvm, and libcxx are stored in directories

alongside each other.

The problem here is that for the build, clang wants to be in llvm/tools
and other components want to be in other places.

Not exactly: cmake has magic discovery when clang is in tools, but it is not a requirement. You can do (for years): cmake -DLLVM_EXTERNAL_CLANG_SOURCE_DIR=path

Should the
monorepository just be structured to have everything in its correct
place for building? My inclination is to say “no” because it reduces
the visibility of the subprojects, but what are the alternatives? There
are two that come to mind off the top of my head, 1) include symlinks in
the repository or 2) change the build so all components can live at the
top level.

I’d expect a cmake shortcut cmake -DLLVM_ENABLE_PROjECTS=clang,libcxx,compiler-rt

I think it’s important to think about these kinds of questions because
once a repository layout has been settled on, it’s hard to change. Yes,
it is relatively easy to move entire directories to new places in git,
but that not only would require changes to whatever entity updates the
subproject repositories, it’s potentially a huge social issue, which are
typically the most difficult problems to address. :slight_smile:

  1. How are the subproject repositories going to be created/migrated?

The individual subproject repositories will have to be created from
scratch after the monrepository is created, right? We can’t just
transition the existing git mirrors to the new setup, correct?

It depends: there are tradeof for each option and I think we need to gather community inputs to settle on one.

A
subproject repository reboot would involve some not insignificant pain
for downstream users because their git histories are suddenly invalid.
They would have to fetch a completely different repository and integrate
it into whatever they have.

If we “reboot” the official git mirrors, I expect
We’d provide scripts for integrating from the new monorepo on top of the existing history.

Ultimately these mirrors are “facilities” but it shouldn’t be significantly harder for downstream to integrate directly from the monorepo with a bit of scripting, and I suspect this scripting is likely to be shareable and committed upstream.

If there is some way to maintain the existing git mirrors and layer new
monorepository commits on top of the existing history that would be
fantastic. I believe it is technically possible (I might need to add
some enhancements to git-subtree :)) but I don’t know if anyone has
explored this. I would love to be told you all have the answers
already. :slight_smile:

Bisecting

For the multirepository proposal, the document talks about having the
git-bisect run script update each submodule during bisection. I suppose
that will work but the bisection would only report that the failure
exists at a particular commit in the umbrella repository, implying a
bunch of different commits, one for each subproject. It wouldn’t really
point to a particular subproject as being the culprit, correct?

Yes, it depends on the frequency of the update of the umbrella.

The
document even hints at this: “it is possible that one commit in the
umbrella repository includes multiple commits in the sub-projects”

That’s what I was getting at with my submodule bisect question. It can
only bisect to a granularity of “one of these subprojects at their
respective commits caused the problem.” With a true monorepository
bisect can drill down to the exact commit within a subproject or across
multiple subprojects if the commit touched multiple subprojects. To me
this is a giant advantage of a non-submodule-based monorepository, which
I think is what the monorepository proposal is.

If everything I’ve written here is generally correct, I think the
monorepository will work for us, as long as each subproject repository
is maintained at a granularity of one subproject commit per commit to
the corresponding directory in the monorepository (i.e. full history is
maintained).

Thanks for you work on this. This kind of work is crucially important
but often unrecognized and underappreciated.

Thanks :slight_smile:

If you have any input on parts of the document that can be made more clear, feel free to chime in in the review.

Mehdi Amini <mehdi.amini@apple.com> writes:

    Yes, though I was only able to figure out how to see an actual
    document by clicking "download raw diff." I'm not sure that's
    giving me the latest version. Is there another convenient way to
    view the document, preferable with the Markdown rendered?
    
Sure, what about a PDF?

Thanks! Is this auto-generated somewhere as the document is updated?

What I am referring here to is: http://llvm.org/svn/llvm-project/
The SVN repo is a monorepo where the history of the subproject is
“weaved”. And we are still able to export to individual git
repositories.

Yes, I later realized what you are saying here. I sent a follow-up
e-mail with some additional questions.

    How are those exports managed? Do you use a tool to filter the
    history for a directory in the monorepository and then export that
    to its own repository?
    
Yes.

There are multiple ways to do that actually.
Conceptually, you can think about it as using `git diff` and `patch -
p1` to take every commit to the monorepo and reapply them on the
individual repo.
The easiest way to achieve it though is probably the facility embedded
in git itself: `git filter-branch --subdirectory-filter=llvm`
Git - git-filter-branch Documentation

Ok, that's what I assumed you would do (filter-branch). Again, the
follow-up e-mail has some additional questions about this.

Also, since GitHub offers an SVN access, you can view the monorepo
offering the same SVN access as we have today. So the individual git
repository can also be just `git svn` on a subdirectory of the SVN
view of the monorepo on GitHub (I’m not sure this sentence is totally
clear).

Oh, that's kind of cool! Are lots of people going to continue using
SVN? I've been assuming that folks would gradually transition over and
there'd come a point where we could shut down SVN support.

        I completely understand the benefits of a monorepository. One
        of the biggest for us was the ability to git-bisect across
        components. How does git-bisect work with submodules? I have
        very little experience with submodules but would like to learn
        more.
        
        Fairly easy, the document mentions it in the examples.

    Ok, I probably skimmed that part since it wasn't directly related
    to describing how the repository would be structured. I'll go back
    and read it in more detail.

Do not hesitate if anything is unclear.

I didn't. :slight_smile: I made some comments on bisect in my follow-up.

                              -David

Mehdi Amini <mehdi.amini@apple.com> writes:

    After going back and reading the proposal again, I think I
    understand the plan. I haven't used the SVN repository for years
    so I was thinking in terms of git, that you'd take the existing
    git mirrors and combine them (visa submodule or some other
    mechanism). I understand now the proposal is to take the SVN root
    and export all of that as one giant git repository. Is that
    correct?

Yes

Hooray! I got it!

    If a commit goes to the monorepository, what is going to extract
    the relevant bits and commit them to the individual mirrors? The
    document notes that with a monorepository a single commit can
    touch multiple projects (that's good!) but something has to
    extract the parts of that commit that are relevant to each
    subproject and then send those parts to the subproject repository.

Right, but note that it is already the case today, some people are
already using SVN to commit to clang and LLVM at the same time

That...is an abomination. :slight_smile:

    There are tools to do this and I think
    git-subtree is a good candidate [disclosure: I am the git-subtree
    maintainer] but I'm just curious what's being considered as a
    solution.
    
Well we haven't decided on anything for the official mirrors. It looks
like you're in a good position to help designing how subtree could
help here :slight_smile:
(I have a fairly good understanding of git, but very limited knowledge
of subtree)

For the subtree split process, git-subtree currently uses an arcane (and
SLOW!) algorithm that I presume was written before filter-branch was
available. I inherited the code so I don't know the full backstory. In
any event, it's buggy in some corner cases so my plan is to transition
it to filter-branch so for the most common splits it would simply be a
more user-friendly wrapper around filter-branch. I'm guessing that's
all the LLVM ecosystem would need. There are some more intricate cases
but those mostly relate to some enhancements I've made that aren't even
public yet.

Anyway I hope will be able to put scripts in the repo so that anyone
downstream can split the repo independently of official mirrors.

That would be excellent.

    The problem here is that for the build, clang wants to be in
    llvm/tools and other components want to be in other places.

Not exactly: cmake has magic discovery when clang is in tools, but it
is not a requirement. You can do (for years): cmake -
DLLVM_EXTERNAL_CLANG_SOURCE_DIR=path

Oh! I didn't know that. That makes certain things I do easier. :slight_smile:

Probably the clang build documents need to be updated. :slight_smile:

    Should the monorepository just be structured to have everything in
    its correct place for building? My inclination is to say "no"
    because it reduces the visibility of the subprojects, but what are
    the alternatives? There are two that come to mind off the top of
    my head, 1) include symlinks in the repository or 2) change the
    build so all components can live at the top level.

I'd expect a cmake shortcut cmake -
DLLVM_ENABLE_PROjECTS=clang,libcxx,compiler-rt

Makes total sense.

    The individual subproject repositories will have to be created
    from scratch after the monrepository is created, right? We can't
    just transition the existing git mirrors to the new setup,
    correct?

It depends: there are tradeof for each option and I think we need to
gather community inputs to settle on one.

Yes. Lots of discussion is needed here.

    A subproject repository reboot would involve some not
    insignificant pain for downstream users because their git
    histories are suddenly invalid. They would have to fetch a
    completely different repository and integrate it into whatever
    they have.

If we "reboot" the official git mirrors, I expect
We'd provide scripts for integrating from the new monorepo on top of
the existing history.

Interesting. If the existing history can be maintained and built upon
that would relieve a lot of burden on users.

Ultimately these mirrors are "facilities" but it shouldn't be
significantly harder for downstream to integrate directly from the
monorepo with a bit of scripting, and I suspect this scripting is
likely to be shareable and committed upstream.

I suspect you are right.

    Bisecting
    
    For the multirepository proposal, the document talks about having
    the git-bisect run script update each submodule during
    bisection. I suppose that will work but the bisection would only
    report that the failure exists at a particular commit in the
    umbrella repository, implying a bunch of different commits, one
    for each subproject. It wouldn't really point to a particular
    subproject as being the culprit, correct?

Yes, it depends on the frequency of the update of the umbrella.

I see what you mean. Yes, you are correct.

    Thanks for you work on this. This kind of work is crucially
    important but often unrecognized and underappreciated.

Thanks :slight_smile:

If you have any input on parts of the document that can be made more
clear, feel free to chime in in the review.

Will do!

                               -David

Hey, I like this idea!

In that case, we don't need the directories in any particular
location, as CMake would be able to find and link any place *we* want
to put them in (in tree, flat out) and pull out their CMake files.

This would also help each project to be built in its own, if they so
require, without upsetting the LLVM-canon build style.

cheers,
--renato