[RFC] One or many git repositories?

Dear all,

I would like to (re-)open a discussion on the following specific question:

  Assuming we are moving the llvm project to git, should we
  a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
  b) use a single git repository for most llvm subprojects.

The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.

Specifically, my proposal is that all LLVM subprojects that are
"version-locked" (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

Let's first talk about the advantages of a single repository. Then
we'll address the disadvantages raised.

At a high level, one repository is simpler than multiple repos that
must be kept in sync using an external mechanism. The submodules
solution requires nontrivial automation to maintain the history of
commits in the umbrella repo (which we need if we want to bisect, or
even just build an old revision of clang), but no such mechanisms are
required if we have a single repo.

Similarly, it's possible to make atomic API changes across subprojects
in a single repo; we simply can't do with the submodules proposal.
And working with llvm release branches becomes much simpler.

In addition, the single repository approach ties branches that contain
changes to subprojects (e.g. clang) to a specific version of llvm
proper. This means that when you switch between two branches that
contain changes to clang, you'll automatically check out the right
llvm bits.

Although we can do this with submodules too, a single repository makes
it much easier.

As a concrete example, suppose you are working on some changes in
clang. You want to commit the changes, then switch to a new branch
based on tip of head and make some new changes. Finally you want to
switch back to your original branch. And when you switch between
branches, you want to get an llvm that's in sync with the clang in
your working copy.

Here's how I'd do it with a monolithic git repository, option (b):

  git commit # old-branch
  git fetch
  git checkout -b new-branch origin/master
  # hack hack hack
  git commit # new-branch
  git checkout old-branch

Here's how I'd do it with option (a), submodules. I've used git -C
here to make it explicit which repo we're working in, but in real life
I'd probably use cd.

  # First, commit to two branches, one in your clang repo and one in your
  # master repo.
  git -C tools/clang commit # old-branch, clang submodule
  git commit # old-branch, master repo
  # Now fetch the submodule and check out head. Start a new branch in the
  # umbrella repo.
  git submodule foreach fetch
  git checkout -b origin/master new-branch
  git submodule update
  # Start a new branch in the clang repo pointing to the current head.
  git checkout -b -C tools/clang new-branch
  # hack hack hack
  # Commit both branches.
  git commit -C tools/clang # new-branch
  git commit # new-branch
  # Check out the old branch.
  git checkout old-branch
  git submodule update

This is twice as many git commands, and almost three times as much
typing, to do the same thing.

Indeed, this is so complicated I expect that many developers wouldn't
bother, and will continue to develop the way we currently do. They
would thus continue to be unable to create clang branches that include
an llvm revision. :frowning:

There are real simplifications and productivity advantages to be had
by using a single repository. They will affect essentially every
developer who makes changes to subprojects other than LLVM proper,
cares about release branches, bisects our code, or builds old
revisions.

So that's the first part, what we have to gain by using a monolithic
repository. Let's address the downsides.

If you'll bear with a hypothetical: Imagine you could somehow make the
monolithic repository behave exactly like the N separate repositories
work today. If so, that would be the best of both worlds: Those of us
who want a monolithic repository could have one, and those of us who
don't would be unaffected. Whatever downsides you were worried about
would evaporate in a mist of rainbows and puppies.

It turns out this hypothetical is very close to reality. The key is
git sparse checkouts [1], which let you check out only some files or
directories from a repository. Using this facility, if you don't like
the switch to a monolithic repository, you can set up your git so
you're (almost) entirely unaffected by it.

If you want to check out only llvm and clang, no problem. Just set up
your .git/info/sparse-checkout file appropriately. Done.

If you want to be able to have two different revisions of llvm and
clang checked out at once (maybe you want to update your clang bits
more often than you update your llvm bits), you can do that too. Make
one sparse checkout just of llvm, and make another sparse checkout
just of clang. Symlink the clang checkout to llvm/tools/clang.
That's it. The two checkouts can even share a common .git dir, so you
don't have to fetch and store everything twice.

As far as I can tell, the only overhead of the monolithic repository
is the extra storage in .git. But this is quite small in the scheme
of things.

The .git dir for the existing monolithic repository [2] is 1.2GB. By
way of comparison, my objdir for a release build of llvm and clang is
3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
0.65G.

If the 1.2G really is a problem for you (or more likely, your
automated infrastructure), a shallow clone [3] takes this down to 90M.

The critical point to me in all this is that it's easy to set up the
monolithic repository to appear like it's a bunch of separate repos.
But it is impossible, insofar as I can tell, to do the opposite. That
is, option (b) is strictly more powerful than option (a).

Renato has understandably pointed out that the current proposal is
pretty far along, so please speak up now if you want to make this
happen. I think we can.

Regards,
-Justin

[1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
info, see http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/.
As far as I can tell, sparse checkouts work fine on Windows, but you
have to use git-bash, see On Windows git: “error: Sparse checkout leaves no entry on the working directory” - Stack Overflow.
[2] https://github.com/llvm-project/llvm-project
[3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git

Justin Lebar via llvm-dev <llvm-dev@lists.llvm.org> writes:

I would like to (re-)open a discussion on the following specific question:

  Assuming we are moving the llvm project to git, should we
  a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
  b) use a single git repository for most llvm subprojects.

The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.

Specifically, my proposal is that all LLVM subprojects that are
"version-locked" (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

FWIW, I'm opposed. I'm not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.

It would be useful to know what problems you see with a single repo that are more significant. In particular, either why you think the problems jlebar already mentioned are worse than he sees them, or what other problems are that he hasn’t addressed.

Hi Justin,

Not true. SVN can be checked out by directory, Git needs to be cloned
on the root.

Today I *can* checkout only LLVM and Clang. On a single Git repo I can't.

cheers,
--renato

+1 to everything Justin points out here (and the rest of the email, which I've snipped for brevity).

Before anything else, I've been through a few of these conversions from SVN to git in other projects. In most of the ones I've seen going to submodules of multiple repo's, a lot of automation is required just to keep things manageable. That's hard to do on a cross-platform basis (do you script in Python, shell script, one per OS, etc.) and is really more trouble than it's worth -- especially when adding new submodules and/or removing them. They're not impossible to do, but they're also much more work than a single repo.

Just to point out some devil's advocate positions:

- Keeping the current structure will be less churn to existing consumers that have "out of tree" builds based on the current structure. Asking them to change their workflow with SVN significantly (since moving to GitHub is mostly swayed by the SVN interface) will probably be non-trivial amounts of work. We probably need to document this well enough or show that the switch won't affect them too badly.

- Some people value keeping the history of the commits in SVN and the Git counterpart once the move happens (for a lot of valid reasons). Making sure we can merge the histories of all the subproject repositories into a single one should be addressed to preserve "provenance".

- Some people like isolation of workflows and concerns. As a git-native convert, I'm not sold on this, but there's some good reasons to be able to do this (maintainers of certain projects will probably enforce different constraints on when/who/how changes can/should/must be made). Making it possible to do so in a monorepo should be explained well (i.e. does this need any special configs on the repo on the server side, on GitHub, etc.).

All in all I think optimising for the case of the everyday developer working on multiple projects (in my case LLVM, Clang, and compiler-rt, and maybe potentially XRay as a subproject too) is a good cause. Whether this translates to every special consumer of the current set-up is less clear at least to me -- so I'd like to know what other stakeholders here think.

Cheers

Chandler Carruth <chandlerc@google.com> writes:

Justin Lebar via llvm-dev <llvm-dev@lists.llvm.org> writes:
> I would like to (re-)open a discussion on the following specific
question:
>
> Assuming we are moving the llvm project to git, should we
> a) use multiple git repositories, linked together as subrepositories
> of an umbrella repo, or
> b) use a single git repository for most llvm subprojects.
>
> The current proposal assembled by Renato follows option (a), but I
> think option (b) will be significantly simpler and more effective.
> Moreover, I think the issues raised with option (b) are either
> incorrect or can be reasonably addressed.
>
> Specifically, my proposal is that all LLVM subprojects that are
> "version-locked" (and/or use the common CMake build system) live in a
> single git repository. That probably means all of the main llvm
> subprojects other than the test-suite and maybe libc++. From looking
> at the repository today that would be: llvm, clang, clang-tools-extra,
> lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

FWIW, I'm opposed. I'm not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.

It would be useful to know what problems you see with a single repo that
are more significant. In particular, either why you think the problems
jlebar already mentioned are worse than he sees them, or what other
problems are that he hasn't addressed.

Running the same 'git checkout' commands on multiple repos has always
been sufficient to manage the multiple repos so far - as long as you
create the same branches and tags in each repo, it's easy[1] to manage
the set of repos with a script that cd's to each one and runs whatever
git command.

So it's a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.

OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I've only checked out the other repos when I was
changing APIs and needed to update them.

I haven't tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd's to several directories
and runs the same git command in each.

Thus, this also sounds like a minor inconvenience. I just don't see how
trading one for the other is worth doing, since AFAICT they're equally
inconvenient.

[1] My understanding of the "umbrella repo" thing for bisecting is that
    it'll be managed automatically by a cron or checkin hooks or
    whatever, so the bit's in jlebar's description about updating
    submodules seem like a red herring. I'm assuming that we end up in a
    place where working with git is essentially the same as we work with
    git-svn today.

Today I *can* checkout only LLVM and Clang. On a single Git repo I can't.

This is true if you s/checkout/clone/. With a single repo, you must
clone (download) everything (*), but after you've done so you can use
sparse checkouts to check out (create a working copy of) only llvm and
clang. So you should only notice the fact that there exist things
other than llvm and clang when you first clone (download) llvm.

Either way switching to git is going to be a change from the status
quo. Personally I'm more interested in finding the best overall
solution than the solution which is "most similar" to the current
setup under some metric.

(*) Technically, if you do a shallow clone, you have to download a
single revision of everything. That's the 90mb number from my
original post.

Chandler Carruth <chandlerc@google.com> writes:

Justin Lebar via llvm-dev <llvm-dev@lists.llvm.org> writes:

I would like to (re-)open a discussion on the following specific
question:

Assuming we are moving the llvm project to git, should we
a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
b) use a single git repository for most llvm subprojects.

The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.

Specifically, my proposal is that all LLVM subprojects that are
“version-locked” (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

FWIW, I’m opposed. I’m not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.

It would be useful to know what problems you see with a single repo that
are more significant. In particular, either why you think the problems
jlebar already mentioned are worse than he sees them, or what other
problems are that he hasn’t addressed.

Running the same ‘git checkout’ commands on multiple repos has always
been sufficient to manage the multiple repos so far - as long as you
create the same branches and tags in each repo, it’s easy[1] to manage
the set of repos with a script that cd’s to each one and runs whatever
git command.

A notable difference is the ability to do API updates across them or the ability to bisect across them.

Also, if the infrastructure that keeps the umbrella repo in sync falls over or has a serious problem, reconstructing version-locked state in order to bisect across those regions of time seems quite challenging. So IMO, it isn’t a minor inconvenience, even if it is something we could overcome.

So it’s a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.

OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I’ve only checked out the other repos when I was
changing APIs and needed to update them.

I haven’t tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd’s to several directories
and runs the same git command in each.

I actually would like to see an example of how you would checkout a common subset with the sparse checkout feature. jlebar, could you give us demo commands for this?

In particular, I’ve had a lot of folks come up and ask me for my script to walk all the directories and run the appropriate git commands in them, and if it is easier to have the GettingStarted page document how to use the sparse checkout thing, that would be nice.

Chandler Carruth <chandlerc@google.com> writes:

Justin Lebar via llvm-dev <llvm-dev@lists.llvm.org> writes:

I would like to (re-)open a discussion on the following specific

question:

Assuming we are moving the llvm project to git, should we
a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
b) use a single git repository for most llvm subprojects.

The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.

Specifically, my proposal is that all LLVM subprojects that are
“version-locked” (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

FWIW, I’m opposed. I’m not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.

It would be useful to know what problems you see with a single repo that
are more significant. In particular, either why you think the problems
jlebar already mentioned are worse than he sees them, or what other
problems are that he hasn’t addressed.

Running the same ‘git checkout’ commands on multiple repos has always
been sufficient to manage the multiple repos so far - as long as you
create the same branches and tags in each repo, it’s easy[1] to manage
the set of repos with a script that cd’s to each one and runs whatever
git command.

So it’s a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.

OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I’ve only checked out the other repos when I was
changing APIs and needed to update them.

I haven’t tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd’s to several directories
and runs the same git command in each.

Thus, this also sounds like a minor inconvenience. I just don’t see how
trading one for the other is worth doing, since AFAICT they’re equally
inconvenient.

IIUC you seem to explain that there are minor inconveniences on both side, but then I’m not sure about why you are opposed? It seems pretty equal,

Also the minor inconvenience in the case of the monolithic repository is happening during the initial setup/clone/checkout, and not during day-to-day development (git pull, git checkout -b, git commit, git push), while the split model induces “minor inconveniences” in the day-to-day developer interaction.
I.e. I prefer using a script to checkout and setup the repo, and then be able to use the standard git commands for interacting with it.

[1] My understanding of the “umbrella repo” thing for bisecting is that
it’ll be managed automatically by a cron or checkin hooks or
whatever,

That’s also something that is fragile to me without a deterministic way to reconstruct it identically from scratch using only the split repositories (which would be possible with "git notes” attached by a server-side hook for instance, but unfortunately github does not allow it, and the current split-repository proposal exclude even discussing the merits of other hosting services).

so the bit’s in jlebar’s description about updating
submodules seem like a red herring. I’m assuming that we end up in a
place where working with git is essentially the same as we work with
git-svn today.

Some people manage today to have a single commit that update clang+llvm at the same time.
I believe doing this in the split-repository model requires write-access to the umbrella repo.

So, we use that to a certain extent.

Linaro's GCC validation uses the full checkout, then do a shallow
checkout that only has the updates.

Our LLVM scripts, OTOH, clone all repos and use worktree for *all*
branches, and we only branch on the repos that we choose, for each
"working dir".

Our scripts probably would need certain modifications... but it should be fine.

But I'm not, by far, the most problematic user.

The real problem, and why people accepted sub-modules, is that a lot
of downstream people only use one or another projects. Mostly LLVM or
Clang or libc++.

Checking out all of it is bad, but having them officially interlinked,
it seems, is worse. IIUC, the problem is that the projects are now
built independently on their projects, but more and more CMake changes
are creeping in, making it harder and harder to separate their
projects from the rest of LLVM. This means they'll now depend on a
much larger body of sources that will need to be compiled together,
and will probably mean they'll abandon LLVM in favour of something
lighter.

I honestly don't know how big is that problem, I don't have it myself,
but I "can imagine" compiling LLVM and Clang without need would be
pretty bad.

cheers,
--renato

Running the same 'git checkout' commands on multiple repos has always been sufficient to manage the multiple repos so far

Huh. It definitely hasn't worked well for me.

Here's the issue I face every day. I may be working on (unrelated)
changes to clang and llvm. I update my llvm tree (say I checked in a
patch, or I want to pull in changes someone else has checked in). Now
I want to go back to hacking on my clang stuff. Because my clang
branch is not connected to a specific LLVM revision, it no longer
compiles. I'm trying to build an old clang against a new llvm.

Now I have to pull the latest clang and rebase my patches. After I
deal with rebase conflicts (not what I wanted to do at the moment!),
I'm in a new state, which means when I build my ccache is no help.
And when I run the clang tests, I don't know whether to expect test
failures. So then I have to pop of my patches and run at head...
(Maybe I have to update clang! In which case I also have to update
llvm...)

This would all be solved with zero work on my part if llvm and clang
were in one repository. Then when I switched to working on my clang
patches, I would automatically check out a version of LLVM that is
compatible.

I think this is the main thing that people aren't getting. Maybe
because it's never been possible before to have a workflow like this.
But having a git branch that you can check out and immediately build
-- without any rebasing, re-syncing, or other messing around -- is
incredibly powerful.

Please let me know if this is still not clear -- it's kind of the key point.

As I said, you can accomplish this with submodules, too, but it
requires the complex hackery from my original email.

To me, this is not at all a minor inconvenience. It's at least an
hour of wasted time every week.

I haven't tried the options jlebar has described to deal with these - sparse checkouts and whatnot, but they seem like an equivalent amount of work/learning curve as writing a script that cd's to several directories and runs the same git command in each.

I'll send sparse checkout instructions separately. But my example
submodules commands are not at all equivalent to a script that cd's
into several directories and runs a git command in each, and I think
this is the main point of confusion. (In fact you wouldn't need to
write such a script; it's just "git submodule foreach".)

The submodules commands creates a single branch in the umbrella repo
that encompasses the checked-out state of *all the LLVM subrepos*. So
you can, at a later time, check out this branch in the umbrella repo
and all the clang, llvm, etc. bits will be identical to the last time
you were on the branch.

If all you want is to continue using git the way you use it now, the
multiple git repos gets you that (as does a sparse checkout on the
single repo). My point is that, the move to git opens up a new, much
more powerful workflow with branches that encompass both llvm and
clang state. We can do this with or without submodules, but using
submodules for this is far more awkward than using a single repo.

-Justin L.

You seem to imply that all the projects in the single repo would be built by default, while it is not part of the proposal.
Actually I’d expect an opt-in mechanism, so that: `mkdir build-llvm && cd build-llvm && cmake ../llvm` only builds LLVM.

Mehdi Amini <mehdi.amini@apple.com> writes:

Running the same 'git checkout' commands on multiple repos has always
been sufficient to manage the multiple repos so far - as long as you
create the same branches and tags in each repo, it's easy[1] to manage
the set of repos with a script that cd's to each one and runs whatever
git command.

So it's a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.

OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I've only checked out the other repos when I was
changing APIs and needed to update them.

I haven't tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd's to several directories
and runs the same git command in each.

Thus, this also sounds like a minor inconvenience. I just don't see how
trading one for the other is worth doing, since AFAICT they're equally
inconvenient.

IIUC you seem to explain that there are minor inconveniences on both
side, but then I’m not sure about why you are opposed? It seems pretty
equal,

I should clarify, this is a -0 kind of opposed. If people overwhelmingly
think this is the way to go, I won't try to block it or anything. I'd
rather not have to update a bunch of workflow, infrastructure, and bots
for no particular reason though.

Also the minor inconvenience in the case of the monolithic repository
is happening during the initial setup/clone/checkout, and not during
day-to-day development (git pull, git checkout -b, git commit, git
push), while the split model induces “minor inconveniences” in the
day-to-day developer interaction.
I.e. I prefer using a script to checkout and setup the repo, and then
be able to use the standard git commands for interacting with it.

[1] My understanding of the "umbrella repo" thing for bisecting is that
   it'll be managed automatically by a cron or checkin hooks or
   whatever,

That’s also something that is fragile to me without a deterministic
way to reconstruct it identically from scratch using only the split
repositories (which would be possible with "git notes” attached by a
server-side hook for instance, but unfortunately github does not allow
it, and the current split-repository proposal exclude even
*discussing* the merits of other hosting services).

I haven't been following that discussion, but that seems surprising
since AFAICT the only particularly compelling reason to move away from
SVN is that it's easy to find good reliable hosting.

This is true if you s/checkout/clone/. With a single repo, you must
clone (download) everything (*), but after you’ve done so you can use
sparse checkouts to check out (create a working copy of) only llvm and
clang. So you should only notice the fact that there exist things
other than llvm and clang when you first clone (download) llvm.

So, we use that to a certain extent.

Linaro’s GCC validation uses the full checkout, then do a shallow
checkout that only has the updates.

Our LLVM scripts, OTOH, clone all repos and use worktree for all
branches, and we only branch on the repos that we choose, for each
“working dir”.

Our scripts probably would need certain modifications… but it should be fine.

But I’m not, by far, the most problematic user.

The real problem, and why people accepted sub-modules, is that a lot
of downstream people only use one or another projects. Mostly LLVM or
Clang or libc++.

Checking out all of it is bad, but having them officially interlinked,
it seems, is worse. IIUC, the problem is that the projects are now
built independently on their projects, but more and more CMake changes
are creeping in, making it harder and harder to separate their
projects from the rest of LLVM. This means they’ll now depend on a
much larger body of sources that will need to be compiled together,
and will probably mean they’ll abandon LLVM in favour of something
lighter.

I honestly don’t know how big is that problem, I don’t have it myself,
but I “can imagine” compiling LLVM and Clang without need would be
pretty bad.

You seem to imply that all the projects in the single repo would be built by default, while it is not part of the proposal.
Actually I’d expect an opt-in mechanism, so that: mkdir build-llvm && cd build-llvm && cmake ../llvm only builds LLVM.

If we end up with a single repository, I agree and think at least some level of opt-in for building subprojects is essential.

I would expect at most to automatically enable building the set of subprojects we currently suggest by default in the getting started docs. Any more than that wouldn’t make sense, and I could even imagine defaulting fewer projects at the build system level.

> This is true if you s/checkout/clone/. With a single repo, you must
> clone (download) everything (*), but after you've done so you can use
> sparse checkouts to check out (create a working copy of) only llvm and
> clang. So you should only notice the fact that there exist things
> other than llvm and clang when you first clone (download) llvm.

So, we use that to a certain extent.

Linaro's GCC validation uses the full checkout, then do a shallow
checkout that only has the updates.

Our LLVM scripts, OTOH, clone all repos and use worktree for *all*
branches, and we only branch on the repos that we choose, for each
"working dir".

Our scripts probably would need certain modifications... but it should be
fine.

But I'm not, by far, the most problematic user.

The real problem, and why people accepted sub-modules, is that a lot
of downstream people only use one or another projects. Mostly LLVM or
Clang or libc++.

Checking out all of it is bad,

Define bad?
Time?
Disk space?
Bandwidth?

I mean, we already assume you have a lot of each anyway?

but having them officially interlinked,

it seems, is worse.

Why?
Below it sounds like you want to do this as a way of enforcing projects to
stay independent of each other.

I would posit that this is not the best way to do this?

We were originally trying to avoid too many moves at the same time.

There is already some CMake efforts to help build the different
repositories, but it's not linked to any proposal.

I think doing so would complicate both build system and version
control migrations...

--renato

I actually would like to see an example of how you would checkout a common subset with the sparse checkout feature. jlebar, could you give us demo commands for this?

$ git clone --depth 1 https://github.com/llvm-project/llvm-project.git
$ cd llvm
$ ls
clang clang-tools-extra compiler-rt dragonegg klee ...
$ git config core.sparsecheckout true
$ echo "/llvm
/clang" > .git/info/sparse-checkout
$ git read-tree -mu HEAD
$ ls
clang llvm

I suppose you could even wrap this in a script and ship that with
llvm, if you wanted.

Checking out all of it is bad,

Define bad?
Time?
Disk space?
Bandwidth?

I mean, we already assume you have a lot of each anyway?

This is not about me, it's about people that use LLVM projects elsewhere.

but having them officially interlinked, it seems, is worse.

Why?
Below it sounds like you want to do this as a way of enforcing projects to
stay independent of each other.

Why every one take my comments as my own personal motives?

I'm just the "consensus seeker". None of these ideas are mine, I'm
just echoing what was said in 320 emails, plus what was said in the
past few years when people discussed about using pure Git.

People in the IRC were saying I had ulterior motives, that I was
pushing people to use GitHub or sub-modules, or whatever. This is
*really* not cool.

Every single thread so far has died down and I wrote a summary, and no
one said anything. Then I created another thread, and wrote another
summary. Once no one was disagreeing, I wrote the text.

Now every one wants to disagree again. Seriously?

I *personally* don't care if we use GitHub, or GitLab, Git or
mercurial. I don't care if we have sub-modules or a monolithic
repository, but I'm not the only user.

LLVM has, so far, taken the modular approach that other projects can
embed our projects on their products. Downstream commercial products
do that, other OSS projects do that, and that's pretty cool.

GCC has had a *huge* flying monster in the last decade because they
weren't modular enough and that has been the big difference of LLVM,
and why it gained traction on impossible partners, like Emacs.

If we're saying we want to close everything down and make a compiler
like GCC, that will make my life **MUCH** easier. So there is
absolutely *no* point in me pushing the other way.

But I'm not the only user... And I'd rather not be selfish.

If the consensus has changed from last week, or if no one has actually
read the emails and threads and want to do it all over again, please
be my guest.

cheers,
--renato

Justin Lebar <jlebar@google.com> writes:

Running the same 'git checkout' commands on multiple repos has
always been sufficient to manage the multiple repos so far

Huh. It definitely hasn't worked well for me.

Here's the issue I face every day. I may be working on (unrelated)
changes to clang and llvm. I update my llvm tree (say I checked in a
patch, or I want to pull in changes someone else has checked in). Now
I want to go back to hacking on my clang stuff. Because my clang
branch is not connected to a specific LLVM revision, it no longer
compiles. I'm trying to build an old clang against a new llvm.

Now I have to pull the latest clang and rebase my patches. After I
deal with rebase conflicts (not what I wanted to do at the moment!),
I'm in a new state, which means when I build my ccache is no help.
And when I run the clang tests, I don't know whether to expect test
failures. So then I have to pop of my patches and run at head...
(Maybe I have to update clang! In which case I also have to update
llvm...)

This would all be solved with zero work on my part if llvm and clang
were in one repository. Then when I switched to working on my clang
patches, I would automatically check out a version of LLVM that is
compatible.

I think this is the main thing that people aren't getting. Maybe
because it's never been possible before to have a workflow like this.
But having a git branch that you can check out and immediately build
-- without any rebasing, re-syncing, or other messing around -- is
incredibly powerful.

I don't know man, when I create a branch to save my clang work I just
create a branch with the same name in all the other repos I have checked
out, then it just stays in the state I left it in as I go do other
stuff. This kind of problem just hasn't really come up for me.

Please let me know if this is still not clear -- it's kind of the key point.

As I said, you can accomplish this with submodules, too, but it
requires the complex hackery from my original email.

To me, this is not at all a minor inconvenience. It's at least an
hour of wasted time every week.

I haven't tried the options jlebar has described to deal with these
- sparse checkouts and whatnot, but they seem like an equivalent
amount of work/learning curve as writing a script that cd's to
several directories and runs the same git command in each.

I'll send sparse checkout instructions separately. But my example
submodules commands are not at all equivalent to a script that cd's
into several directories and runs a git command in each, and I think
this is the main point of confusion. (In fact you wouldn't need to
write such a script; it's just "git submodule foreach".)

The submodules commands creates a single branch in the umbrella repo
that encompasses the checked-out state of *all the LLVM subrepos*. So
you can, at a later time, check out this branch in the umbrella repo
and all the clang, llvm, etc. bits will be identical to the last time
you were on the branch.

If all you want is to continue using git the way you use it now, the
multiple git repos gets you that (as does a sparse checkout on the
single repo). My point is that, the move to git opens up a new, much
more powerful workflow with branches that encompass both llvm and
clang state. We can do this with or without submodules, but using
submodules for this is far more awkward than using a single repo.

If I do `git log` in a sparse checkout that just has LLVM, will it only
show me LLVM commits? That is, how easy is it to filter out the
clang/lldb/subproject-X commits from a log? Negative globs are kind of
awkward.