Git Move: GitHub+modules proposal

So,

It's been a while and the GitHub thread is officially dead, so I'll
propose a development methodology based on the feedback from that
thread. This is not *my* view, but all that was discussed in the
threads.

My objective is to form an official proposal to use Git as our main
repository, overcoming all the problems we currently have without
creating many others. In the end, I think Tanya wanted to make a vote,
and see how string the community feels about it. The vote should be
"Should we move to GitHub as the proposed workflow, or should we try
to find another solution away from our own hosting?".

The important part is that we *must* move from the current schema. It
does not scale and the administrative costs are not worth the trouble.
So, if we don't go with GitHub, we have to find another way out.

The proposal

1. Control the history via server hooks updating a unique and
auto-increment identifier, which will apply to every commit on its
submodules (ie, every other LLVM project).

Does github allow this? IIRC their support for server-side hooks was
very limited due to obvious reasons. And executing hooks e.g. on
llvm.org seems very error-prone.

Someone suggested it was possible. I have sent them an email with a
draft proposal and they said everything was fine, though they didn't
confirm specific support.

I can't see shy changing a local auto-increment ID on the repository
itself would be a breach of security, so even if there are
limitations, I think we can get this done.

I'll send them another email to confirm this specific point.

cheers,
--renato

I really liked the the solution proposed earlier in this thread: Do nothing server side, but instead use
`git rev-list --count master` on the client side (which takes 0.9s on my machine) to get the number of the commit. So nothing to do on the ID part IMO.

As for updating the meta repository: We could disable write access for the normal llvm developer and delegate the submodule bumping to an external
server. I believe this would be an easy enough job for buildbot or jenkins.

- Matthias

I really liked the the solution proposed earlier in this thread: Do nothing server side, but instead use
`git rev-list --count master` on the client side (which takes 0.9s on my machine) to get the number of the commit. So nothing to do on the ID part IMO.

Mehdi replied to this proposal:

"it does not help to solve the cross-repository problem, we need a
"meta integration repository"."

As for updating the meta repository: We could disable write access for the normal llvm developer and delegate the submodule bumping to an external
server. I believe this would be an easy enough job for buildbot or jenkins.

The plan is to disable all write access to this repository (otherwise
we'll create a nightmare). Having an external counter could be
problematic due to synchronisation issues.

If the hook doesn't work, we'll have serious problems.

cheers,
--renato

As for updating the meta repository: We could disable write access for the normal llvm developer and delegate the submodule bumping to an external
server. I believe this would be an easy enough job for buildbot or jenkins.

The plan is to disable all write access to this repository (otherwise
we'll create a nightmare). Having an external counter could be
problematic due to synchronisation issues.

If the hook doesn't work, we'll have serious problems.

So, I probably missed something, but what was the main objection to
just using submodules? This would put llvm inside clang instead of the
other way around. When changing an API one currently has to

* Change llvm.
* Quickly change clang and hope no bot picks up a revision with only
the llvm change.

With submodules it becomes

* Change llvm.
* Change clang and in the same atomic commit change what revision the
submodule points to.

Cheers,
Rafael

Two problems:

  1. Submodules have some UX problems for developers around updating the parent project and its effects on the submodule which make them annoying to use.
  2. I find the advantage you claim especially scary and bad. Put another way: if a developer doesn’t make a commit to clang with the new submodule pointer, clang will not actually start using the new version of LLVM until someone gets around to updating the pointer. Meaning, the next time anyone ELSE goes to update the submodule pointer in clang, they would have to, effectively, integrate all of the as-yet-untested-with-clang changes from llvm, and fix any problems that might cause. I really don’t think we want that.

What I do think would be nice is a way to associate commits to llvm and commits to clang as one unit that updates a parent repository. I’ve mentioned before that gerrit seems to have that functionality.

I think it’d be a great idea to do some testing, and make a second proposal centered around using Gerrit to manage commits to the github repository, versus committing to github directly. I’m not sure if I’ll have time to do that properly, though.

So, I probably missed something, but what was the main objection to
just using submodules? This would put llvm inside clang instead of the
other way around. When changing an API one currently has to

I don't think the consensus was to change the order of inclusion (llvm
into clang), but to *not* change anything else at this stage.

That's one of the reasons the umbrella project with sub-modules was
the most accepted solution, because we can later change the inclusion
order (if we all agree, etc), without changing the underlying storage.

* Change llvm.
* Change clang and in the same atomic commit change what revision the
submodule points to.

Having one repository inside another was rejected due to the problems
it brings for development, validation and release. James has just
pointed a few of those problems for development.

An umbrella project with a commit hook and an auto-update would make
sure all commits are synchronised correctly. Though, indeed, this will
mean we'll still have the trouble of buildbots picking up one commit
and not the other, I don't think this is a big enough problem that we
should mess up everyone's workflow.

cheers,
--renato

I think that trying to create a ordering/rev number between independent git repositories is fundamentally unreliable.

If we want to keep llvm and clang in lock step we should probably probably just have them in the same repository like https://github.com/llvm-project/llvm-project.

Cheers,
Rafael

Hello there,

Renato, thank you for putting everything together.

Talking about second question (commits mailing list): github provides set of various web hooks. I think here we are interested
In 'push’es particularly.

Besides that it has some CI related integrations: buildbots can update pull request status to show if tests are passing or not. The builds can be also triggered using web hooks (issue_comment with specific text). IIRC swift and rust do this (and more) in a very similar way.

Cheers,
Alex.

[1] https://developer.github.com/webhooks

That is similar to the proposal we have, except that llvm-projects
will have sub-modules.

Having all of them in the same physical repository is a big problems
for those that only use a small subset of the components, which is the
vast majority of users, most of the time (all buildbots, Jenkins,
local development, downstream users, even releases don't clone all
repos).

cheers,
--renato

It has also submodules. https://github.com/llvm-project/llvm-project-submodule

Both llvm-project(-tree) and (-submodule) have refs/notes/commits.

Nice! Can you try a server hook that will add an auto-increment number
from submodules commits?

cheers,
--renato

Why? Assuming we don’t have branches, there was many mention that the id can be computed from the number of commits in the history.

This is easy to compute when coming from SVN, the difficulty will be to keep this when having multiple git repo as a source.

I really like this too, and think Takumi has basically solved 90% of
the problem for us already. We may want to add an "rN" line to avoid
scaring people with hex commits, but that seems to be all that's
lacking and not really essential anyway.

Tim.

Not so much scaring, but to avoid rushed migrations.

Current SVN-aware handling (downstream infra, bisects) deal better
with sequential numbers, and they may take some time to migrate to
fill Git solution.

We should move all upstream infrastructure, but we can't dictate on
the downstream pace (or it would never happen).

cheers,
--renato

We have branches (release_nm) and we may want them to be in the same
sequential numbering.

So, I'm assuming this hook gets executed every time a new commit
arrives, but they're sub-modules, would they also notify the parents?

If this works in a trigger mode (every commit), not a timed basis
(every 5 minutes), then it could work well.

cheers,
--renato

I was seeing the 90% of the problem being “how to have the cross repo synchronization in git and on github”. Right now coming from a single SVN repo it seems comparatively trivial to me.

Hi all,

A short summary: Takumi has done 90% of the work here:

https://github.com/llvm-project/llvm-project-submodule

and I've been talking to GitHub, and here are the answers to my questions:

1. How will the umbrella project's auto-increment hook work?

Since the umbrella project cannot see the sub-modules' commits without
some form of update, there are two ways to do this:

P. Per push: Every push (not commit) on all other repositories will
trigger a hook that will hit a URL on our server, telling it to
generate an incremental ID, update some umbrella's SeqID property (or
even a commit SHA) and update the sub-modules.

T. Time based: A cron job in our server would frequently pull from all
repos and update ID/modules.

Option P is less confusion and more fine grained, but if it misfire,
we'll lose that push, and its commits will be bundled with the next
push on that repo.

Option T will invariably bundle things together, even from different
repositories. The change that this will "correctly" merge an
LLVM+Clang double-patch is not worth the trouble for the noise.

For both of them, we need an external server, as there's no way to
update a repository's property from another.

Multiple commits eventually getting into a single umbrella revision
can be innocuous for development, but they can make controlling the
version for releases a bit more complicated. Though, it would also
have no effect on back-ports, since they'll be done on Git and get
their own SeqID.

All in all, I'm not too worried about this...

2. How do we update the commits mailing lists?

This is, apparently, trivial in GitHub:

https://help.github.com/articles/managing-notifications-for-pushes-to-a-repository/#enabling-email-service-notifications-for-pushes-to-your-repository

Any more comments before we put this proposal to vote?

Is anyone going to propose an alternative Git solution?

Or maybe an external, reliable and trustworthy SVN repository (ie.
*not* SourceForge)?

In the interests of brevity and peacefulness, we should aim to only
have one vote, even if it has multiple choices, so if we have more
proposals, please bring them up.

cheers,
--renato