GitHub Hooks

So, there's been a bit of a misunderstanding about the hooks that are
supported in GitHub, and after talking to the GitHub staff, I'd like
to clarify what they are and how we can use them.

  1. Pre-commit hooks, avoiding forced pushes / re-order

GitHub doesn't support server hooks due to security concerns.

But there is an alternative:

https://help.github.com/articles/about-required-status-checks/

I don't know how we'd check for non-ff-merges with this, and I'd
appreciate if someone with better GitHub knowledge could chime in. But
they *do* stop pushes from going in, which is what we want. Maybe we
would need a web-service (see 2) to get this working.

How does Swift solve this? Do we really need a linear history on the
projects, or just on the umbrella project?

2. Post-commit umbrella updates

We can use webhooks:

https://developer.github.com/webhooks/

This would hit some webpage / buildbot and make them update the
llvm-projects (with sub-modules) via git.

We'd be required to maintain a piece of web service somewhere, but the
maintenance of that web-service will be a lot less than the current
SVN/Git servers.

3. Post-commit email, for review/history

We can use email Webhooks:

https://help.github.com/articles/managing-notifications-for-pushes-to-a-repository/

This would be enabled on all projects except the umbrella, so we can
continue to make post-commit review.

I believe 2 and 3 should be reasonably easy to set up, but I'm not
sure about 1. It looks like it could work, but this is really a GitHub
thing more than a Git thing.

Any ideas?

cheers,
-renato

So, there's been a bit of a misunderstanding about the hooks that are
supported in GitHub, and after talking to the GitHub staff, I'd like
to clarify what they are and how we can use them.

1. Pre-commit hooks, avoiding forced pushes / re-order

GitHub doesn't support server hooks due to security concerns.

But there is an alternative:

About protected branches - GitHub Docs

Does this require submitting a PR for each commit?

I don't know how we'd check for non-ff-merges with this, and I'd
appreciate if someone with better GitHub knowledge could chime in. But
they *do* stop pushes from going in, which is what we want. Maybe we
would need a web-service (see 2) to get this working.

How does Swift solve this?

It doesn't, we use a pull request model.

It's encouraged to run the CI on a PR before merging but not mandatory.

Committing to the repo directly (without going through a PR) is possible.

Do we really need a linear history on the
projects, or just on the umbrella project?

I, for one, am in favor of maintaining a linear history in all the sub-projects.

2. Post-commit umbrella updates

We can use webhooks:

About webhooks - GitHub Docs

This would hit some webpage / buildbot and make them update the
llvm-projects (with sub-modules) via git.

We'd be required to maintain a piece of web service somewhere, but the
maintenance of that web-service will be a lot less than the current
SVN/Git servers.

So the web-service will maintain a canonical set of repos with linear history?

If so, + 1.

best,
vedant

So, there's been a bit of a misunderstanding about the hooks that are
supported in GitHub, and after talking to the GitHub staff, I'd like
to clarify what they are and how we can use them.

1. Pre-commit hooks, avoiding forced pushes / re-order

GitHub doesn't support server hooks due to security concerns.

But there is an alternative:

About protected branches - GitHub Docs

I don't know how we'd check for non-ff-merges with this, and I'd
appreciate if someone with better GitHub knowledge could chime in. But
they *do* stop pushes from going in, which is what we want. Maybe we
would need a web-service (see 2) to get this working.

How does Swift solve this? Do we really need a linear history on the
projects, or just on the umbrella project?

Just to repeat what I said earlier I consider linear history an important feature!

About the status checks: This sounds like we have to setup some buildbot/jenkins instance that reacts when pull requests are created with a simple job
that checks linear history, commit date for us and then marks the commit as good to push as the last step (or pushes it automatically).
(We could extend this later to more advanced pre-commit sanity checking like checking whether stuff compiles etc...)
This will be work we have to do.

About protected branches - GitHub Docs

Does this require submitting a PR for each commit?

It shouldn't do. It looks like the workflow would be:

1. Make your changes locally.
2. Push to your own fork (alternatively your own branch on github).
3. Run a sanity-check script that then POSTs to the status API on this
branch to indicate success.
4. git push to master.

I'm guessing a bit with step 2 -- I assume it has to allow statuses
from other forks, otherwise there would be chaos on LLVM's main
repository.

What I'm still not entirely clear on is how the status is
authenticated, it looks like it has to be an entirely trust-based
system. It doesn't look like you can check "this status was set by a
server at llvm.org".

From here and IRC, it seems like the current checks we might want are:

1. Linear history.
2. Monotonic timestamps.
3. No timestamps from the future (checked against a reliable internet
clock, not locally).

Tim.

P.S. Notes on using statuses for experiments:

1. Create OAuth token here https://github.com/settings/tokens (scope
should be repo:/status).
2. Put simple status json into file:
{
  "state": "success",
  "target_url": "https://example.com/build/status",
  "description": "The build succeeded!",
  "context": "sanity-check"
}

3. Update a commit on GitHub. E.g. (last hash is a commit, substitute
your username/OAuth token for authentication):

curl -H "Content-Type: application/json" -u TNorthover:<OAuth SHA>
--data @simple.txt
https://api.github.com/repos/TNorthover/tmp/statuses/d0679d64a1f8b70bb33d27123ff4ed23dad18012

4. Now that at least one status exists, you can enable protected
branch on your project. I have https://github.com/TNorthover/tmp for
now (with master protected). I'd be happy to add anyone as a developer
if they want to play too (could be good for testing the fork
behaviour).

5. GitHub will deny a push to master unless it has a visible commit
with the successful status.

About the status checks: This sounds like we have to setup some buildbot/jenkins instance that reacts when pull requests are created with a simple job
that checks linear history, commit date for us and then marks the commit as good to push as the last step (or pushes it automatically).

IIUC, we don't need that. If the hook returns "success" (as it it
builds, or it's linear), the push is allowed. If not, it's blocked.

(We could extend this later to more advanced pre-commit sanity checking like checking whether stuff compiles etc...)

Yes, and I think a lot of people would want that, but we need to setup
some infrastructure to do that (Jenkins, Buildbots, etc) and that's
not trivial.

We'll leave it for later.

cheers,
--renato

So, there's been a bit of a misunderstanding about the hooks that are
supported in GitHub, and after talking to the GitHub staff, I'd like
to clarify what they are and how we can use them.

1. Pre-commit hooks, avoiding forced pushes / re-order

GitHub doesn't support server hooks due to security concerns.

Well, maybe GitHub hosting is not the right choice for our requirement then.

But there is an alternative:

About protected branches - GitHub Docs

I don't know how we'd check for non-ff-merges with this, and I'd
appreciate if someone with better GitHub knowledge could chime in. But
they *do* stop pushes from going in, which is what we want. Maybe we
would need a web-service (see 2) to get this working.

How does Swift solve this? Do we really need a linear history on the
projects, or just on the umbrella project?

2. Post-commit umbrella updates

We can use webhooks:

About webhooks - GitHub Docs

This would hit some webpage / buildbot and make them update the
llvm-projects (with sub-modules) via git.

We'd be required to maintain a piece of web service somewhere, but the
maintenance of that web-service will be a lot less than the current
SVN/Git servers.

Claiming that it "will be *a lot* less” burden that now is easy, but I don’t see any obvious fact to back this up.
What is the current maintenance requirement of SVN/Git? Can someone who knows provides some fact?

CC Anton just in case.

2. Push to your own fork (alternatively your own branch on github).

We don't want branches on the official repo. Forks would be perfectly
fine, though and very GitHuby.

I'd be happy to add anyone as a developer
if they want to play too (could be good for testing the fork
behaviour).

Can you add me, please? user = rengolin.

cheers,
--renato

I'll let Anton tell his side, and Tanya talk about the real costs, but
here are some facts I know:

Our ARM/AArch64 buildbots fail around 2~3 times a month with SVN
errors. Sometimes it's only the fast ones, sometimes all of them
(depends on how long it takes to fix). Sometimes the fix is just
"wait", sometimes Anton has to actively fix it. (he also has to work,
sleep, eat, etc).

In the past, we were hit by web spiders that ignored completely the
robots.txt file. Anton has made that better, but it can escalate if
the spider realise we blocked them. There are ways to work around, but
not without accidentally blocking innocent people (mostly in China).

The cost of the AWS servers is ~$5k / year. It's not *only* for SVN,
but also for web servers and hosting packages. Recently we turned off
the deb hosting because of budget (our server and bandwidth couldn't
cope with it).

So, while $5k/year might not look like much, it's enough to pay a lot
of students to go to the LLVM events, that couldn't otherwise go. It's
also nowhere near what we would like if we were to host a robust
repository with the features that GitHub can provide. Mainly
bandwidth, storage, stability and support.

Given the AWS costs that I've seen at Linaro, we'd have to *at least*
double that money to host a dedicated machine with enough bandwidth to
have repositories, binaries, videos etc. not counting paying someone
to actively maintain it, if we want to compare one to one with what
GitHub provides for free.

I will make no attempt at estimating Anton's time, or Tanya's or
anyone else's, but I believe they (and their companies/universities)
would very much rather they work on actual compiler stuff. I'm sure
that, if we join the human cost, it'll far outweigh the infrastructure
costs, even if we double/triple our current spending.

On the other hand, as Tim has shown, a web-service with a JSON file
will be running some web server which is light and cheaper than a
normal web-page to deliver (less content, less bandwidth, less
storage, less I/O), and could serve hundreds, if not thousands of
queries per second with a small AWS image.

The web-hooks would be setup once and hosted by GitHub, so zero
additional work from our side, as well as all the forking, branching,
merging, SVN interface (which we can't easily get if we move to local
Git).

The level of failure in the web-services will be lower (lower load,
less probability of barfing) and even if it does, it will only affect
the services that use it (buildbots, LNT, bisect), not any other
developer.

Moreover, our side of the web-service can fail catastrophically and
need a wipe and restart, and *none* of our commit history would be
affected. On the other hand, if the SVN fails catastrophically today,
I don't know if we have a good backup policy that will mean commits
could be lost. GitHub may not provide guarantees, but they do have
proper backup policies.

All in all, may not look much, but running a decent and stable web
service with so much at stake is *not* a simple task, and we shouldn't
take it for granted.

cheers,
--renato

Claiming that it "will be *a lot* less” burden that now is easy, but I don’t see any obvious fact to back this up.
What is the current maintenance requirement of SVN/Git? Can someone who knows provides some fact?

I'll let Anton tell his side, and Tanya talk about the real costs, but
here are some facts I know:

Our ARM/AArch64 buildbots fail around 2~3 times a month with SVN
errors. Sometimes it's only the fast ones, sometimes all of them
(depends on how long it takes to fix).

That’s relevant data.

Sometimes the fix is just
"wait", sometimes Anton has to actively fix it. (he also has to work,
sleep, eat, etc).

In the past, we were hit by web spiders that ignored completely the
robots.txt file. Anton has made that better, but it can escalate if
the spider realise we blocked them. There are ways to work around, but
not without accidentally blocking innocent people (mostly in China).

That’s not relevant: this is about the WWW server, it does not have to be related to the hosting the repos.

The cost of the AWS servers is ~$5k / year. It's not *only* for SVN,
but also for web servers and hosting packages. Recently we turned off
the deb hosting because of budget (our server and bandwidth couldn't
cope with it).

Same.

So, while $5k/year might not look like much, it's enough to pay a lot
of students to go to the LLVM events, that couldn't otherwise go.

Moving the SVN repo does not solve hosting videos, Debian packages, etc.
I suspect most of the bandwidth does not come from `svn up` or `git pull`.

It's
also nowhere near what we would like if we were to host a robust
repository with the features that GitHub can provide.

Like… proper hooks?

Mainly
bandwidth, storage, stability and support.

Given the AWS costs that I've seen at Linaro, we'd have to *at least*
double that money to host a dedicated machine with enough bandwidth to
have repositories, binaries, videos etc. not counting paying someone
to actively maintain it, if we want to compare one to one with what
GitHub provides for free.

You’re again conflating svn/git and hosting “binaries and videos”. I don’t think we ever planned to host these on github?

I will make no attempt at estimating Anton's time, or Tanya's or
anyone else's, but I believe they (and their companies/universities)
would very much rather they work on actual compiler stuff. I'm sure
that, if we join the human cost, it'll far outweigh the infrastructure
costs, even if we double/triple our current spending.

Possibly, I don’t know, but that’s exactly why I asked for first hand data on the subject (i.e. Anton and/or Tanya) about hosting the git/SVN repos themselves, instead of hand-wavy “I believe” discussions.

On the other hand, as Tim has shown, a web-service with a JSON file
will be running some web server which is light and cheaper than a
normal web-page to deliver (less content, less bandwidth, less
storage, less I/O), and could serve hundreds, if not thousands of
queries per second with a small AWS image.

The web-hooks would be setup once and hosted by GitHub, so zero
additional work from our side, as well as all the forking, branching,
merging, SVN interface (which we can't easily get if we move to local
Git).

The level of failure in the web-services will be lower (lower load,
less probability of barfing) and even if it does, it will only affect
the services that use it (buildbots, LNT, bisect), not any other
developer.

Moreover, our side of the web-service can fail catastrophically and
need a wipe and restart, and *none* of our commit history would be
affected. On the other hand, if the SVN fails catastrophically today,
I don't know if we have a good backup policy that will mean commits
could be lost. GitHub may not provide guarantees, but they do have
proper backup policies.

All in all, may not look much, but running a decent and stable web
service with so much at stake is *not* a simple task, and we shouldn't
take it for granted.

Sure, "running a decent and stable web service is not a simple task”, that’s what I’m saying.

In the past, we were hit by web spiders that ignored completely the
robots.txt file. Anton has made that better, but it can escalate if
the spider realise we blocked them. There are ways to work around, but
not without accidentally blocking innocent people (mostly in China).

That’s not relevant: this is about the WWW server, it does not have to be related to the hosting the repos.

No, this is about hosting the SVN server. The SVN view was disabled
for months this year before we could really see what was going on.

Moving the SVN repo does not solve hosting videos, Debian packages, etc.
I suspect most of the bandwidth does not come from `svn up` or `git pull`.

They share the same bandwidth, and sometimes the same server. It is relevant.

One thing making SVN slow was the amount of Debian packages being
downloaded form the same place.

Like… proper hooks?

If we can work around it, and it seems we can, this is not such a big issue.

You’re again conflating svn/git and hosting “binaries and videos”. I don’t think we ever planned to host these on github?

No, but they all share bandwidth. We moved videos to Youtube to
offload the bandwidth, and moving the code to GitHub shares the same
mindset.

Possibly, I don’t know, but that’s exactly why I asked for first hand data on the subject (i.e. Anton and/or Tanya) about hosting the git/SVN repos themselves, instead of hand-wavy “I believe” discussions.

Bear in mind that I gave you facts (bandwidth problems, turned off SVN
services, constant breakdowns, expertise in handling traffic, backup
solutions).

I also made you aware that the human cost is not *just* Tanya and
Anton, but also me and everyone else that maintains buildbots,
external mirrors, etc. and it *is* larger than the hardware costs. You
just don't see it because we're all volunteers.

Branding them as "hand-wavy I believe" is *not* appropriate.

cheers,
--renato

In the past, we were hit by web spiders that ignored completely the
robots.txt file. Anton has made that better, but it can escalate if
the spider realise we blocked them. There are ways to work around, but
not without accidentally blocking innocent people (mostly in China).

That’s not relevant: this is about the WWW server, it does not have to be related to the hosting the repos.

No, this is about hosting the SVN server. The SVN view was disabled
for months this year before we could really see what was going on.

I don’t believe the online SVN viewer has to be on the server that hosts the repo that everyone access: the WWW server could mirror the SVN to provide local access to the viewer if needed (hence why I view this as unrelated to hosting source code).

Moving the SVN repo does not solve hosting videos, Debian packages, etc.
I suspect most of the bandwidth does not come from `svn up` or `git pull`.

They share the same bandwidth, and sometimes the same server. It is relevant.

Well, “they share the same bandwidth” is exactly what I mean by “conflating the issues”.
They don’t *have to* share the same bandwidth. Hosting repos could be setup totally separated from hosting WWW.
You need to account things properly.

One thing making SVN slow was the amount of Debian packages being
downloaded form the same place.

Like… proper hooks?

If we can work around it, and it seems we can, this is not such a big issue.

You’re again conflating svn/git and hosting “binaries and videos”. I don’t think we ever planned to host these on github?

No, but they all share bandwidth. We moved videos to Youtube to
offload the bandwidth, and moving the code to GitHub shares the same
mindset.

It shares the same mindset *only* if the code itself is a significant bandwidth consumer, otherwise no it does not make sense.

Possibly, I don’t know, but that’s exactly why I asked for first hand data on the subject (i.e. Anton and/or Tanya) about hosting the git/SVN repos themselves, instead of hand-wavy “I believe” discussions.

Bear in mind that I gave you facts (bandwidth problems, turned off SVN
services, constant breakdowns, expertise in handling traffic, backup
solutions).

And I consider many of the “facts” you gave to conflate other element than hosting the repository *alone*, which makes it hard to me to see them as relevant as-is.

I also made you aware that the human cost is not *just* Tanya and
Anton, but also me and everyone else that maintains buildbots,
external mirrors, etc. and it *is* larger than the hardware costs. You
just don't see it because we're all volunteers.

Branding them as "hand-wavy I believe" is *not* appropriate.

I apologize if I hurt your feeling, but the reality is that I feel you’re conflating multiple things together that are not directly related to “moving the repository only”, and that does not help to be convincing. My use of “hand-wavy”, if that’s what bothered you, means really that (I’m not attaching any other value judgement to this expression as a non-native speaker, maybe it is not the right choice of word from a non-native speaker).

Perhaps it helps to know that I have access to the machines and have helped debug many of the current problems. I’m not speaking from the outside, guessing how hard things are.

I also think you are assuming a lot about where services can be hosted and at which cost (labour, not hardware).

So, unless you are volunteering to take care of the whole infrastructure, I suggest taking the opinion of people that are working with it a but more seriously.

However, I’d still trust github over anyone of us any day, to host our repositories. Any one of us could leave the community at any time, but github, as a company, with many employees and a successful business, will probably outlast any if us in our current employments.

From that point of view, the foundation would be better betting their money on a stable product than on individual volunteers. While it’s cool that some people volunteer, you can’t base such a critical system on it.

Cheers,
Renato

Perhaps it helps to know that I have access to the machines and have helped debug many of the current problems. I’m not speaking from the outside, guessing how hard things are.

I also think you are assuming a lot about where services can be hosted and at which cost (labour, not hardware).

It seems you’re assuming as much about me as you think I’m assuming about this whole thing…

So, unless you are volunteering to take care of the whole infrastructure, I suggest taking the opinion of people that are working with it a but more seriously.

Sorry, I appreciate your opinion, but it is still not “facts” or “data”. When you write “it will save a lot of maintenance”, I raise my eyebrows. I’ll take what you write seriously when it’ll be be phrased less qualitatively and more quantitatively, i.e. for instance I would ask less if you wrote “currently volunteers (Anton, Tanya, whoever) spends on average 5h a week to fix problem purely related to the repositories (not the video, not the Debian packages, etc.) and moving to an external host would save totally these 5h/week”.

I’ve not read all of the github threads, so sorry if this has been brought up, but…

You’re not taking into account that we’re all volunteers and could disappear overnight.

You’re taking the stance that this is a company with employees and continuity plans, when in reality, our hours worked are less important than the time it takes for us to fix the problems, which are measured in days, weeks or months, not hours or minutes. Everyone else’s time will also be affected (and was) in addition to ours.

You’re also assuming that we’re being compensated or at least doing this on our work hours. I don’t know about Anton, but I’m mostly doing this on my spare time (it’s almost 1am here, and I’m in bed already).

There are things that you can’t put a price on, and comparing volunteered personal time with wages and hardware cost is, indeed, a bit offensive.

Cheers,
Renato

That is what I’m proposing, and Tim is helping us test. We should reach a solution quickly, and once we do, I’ll update the document.

Feel free to try his repo, I’ll only try tomorrow. If you guys come up with a clear flow before that, let me know.

Cheers,
Renato

I have already tested protected branches on GitHub successfully and found it allowed exactly the pushes that were correct – they must all have the current HEAD as an ancestor, and so they always move the repo forward without dropping already pushed patches.

At most, it would might make sense to have some client-side scripts we encourage users to install that check for accidental pushes of massive series of patches in a single go and warn them about it.

Right! Can you add a comment to the document review?

The other two hooks were good (email, update umbrella), so I think we’re set.

Cheers,
Renato

I can try to find time; honestly haven’t had the spare time recently to track or contributed significantly to the github thing, I just wanted to try and avoid the search for a complex solution if the simple one is available and sufficient.

Protected branch alone don’t enforce a linear history without the "status checks” feature. I don’t believe Chandler is proposing to use "status checks”, and he is concerned with “rewriting the history” more than enforcing a linear history, so you’ll have to be careful about what is the promise exactly.