RFC: Move the test-suite LLVM project to GitHub?

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:

  1. It contains all manner of crazily licensed code.

  2. We don’t really care about the history at all. Any concerns around linear history or bisection are pretty much irrelevant.

  3. We don’t ever plan to have LLVM code move into or out from the test-suite

  4. Its already big, and really should be much bigger. We shouldn’t have incentives to keep stuff out of the test suite because of size, hosting cost, or anything else.

For all of these reasons, and also because I’d like to see how well (or rather, how poorly) a service like GitHub actually works for the project, it seems like splitting the test-suite out of the current subversion repository and moving it there is the right call.

When I chatted with folks on the board, this made sense to them as well, and I’ve made sure we have a reasonable LLVM organization set up on GitHub and all the board members are on it: https://github.com/llvm (I think only my membership is public at the moment).

There is still plenty to figure out about how to manage this on github, but before doing anything else I just wanted to shoot an email and see if folks like this idea.

Thanks!
-Chandler

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:

  1. It contains all manner of crazily licensed code.

  2. We don’t really care about the history at all. Any concerns around linear history or bisection are pretty much irrelevant.

  3. We don’t ever plan to have LLVM code move into or out from the test-suite

  4. Its already big, and really should be much bigger. We shouldn’t have incentives to keep stuff out of the test suite because of size, hosting cost, or anything else.

For all of these reasons, and also because I’d like to see how well (or rather, how poorly) a service like GitHub actually works for the project, it seems like splitting the test-suite out of the current subversion repository and moving it there is the right call.

When I chatted with folks on the board, this made sense to them as well, and I’ve made sure we have a reasonable LLVM organization set up on GitHub and all the board members are on it: https://github.com/llvm (I think only my membership is public at the moment).

There is still plenty to figure out about how to manage this on github,

Thank for pointing out this part. We certainly need a discussion on how to handle a git repo, but that can be after others have chimed in on this post.

but before doing anything else I just wanted to shoot an email and see if folks like this idea.

Sounds good. +1 from me for all the various reasons you already stated.

Cheers,
Pete

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:
1) It contains all manner of crazily licensed code.
2) We don't really care about the history at all. Any concerns around
linear history or bisection are pretty much irrelevant.
3) We don't ever plan to have LLVM code move into or out from the
test-suite
4) Its already big, and really should be much bigger. We shouldn't have
incentives to keep stuff out of the test suite because of size, hosting
cost, or anything else.

For all of these reasons, and also because I'd like to see how well (or
rather, how poorly) a service like GitHub actually works for the project,
it seems like splitting the test-suite out of the current subversion
repository and moving it there is the right call.

When I chatted with folks on the board, this made sense to them as well,
and I've made sure we have a reasonable LLVM organization set up on GitHub
and all the board members are on it: https://github.com/llvm (I think
only my membership is public at the moment).

There is still plenty to figure out about how to manage this on github,
but before doing anything else I just wanted to shoot an email and see if
folks like this idea.

My question's probably somewhere in this "plenty to figure out" but it'll
be moderately annoying to have multiple ways of managing LLVM subprojects
(yeah, I realize this is sort of an exceptional one - one I don't usually
have checked out anyway, so I don't care too much - but it would break my
cute little script that trawls subrepositories and syncs them all up in my
llvm repo/checkout). Also, I assume there's some amount of version lock
between the rest of the project and the test-suite (cleaning it up if we
make breaking changes, etc - the idea of having LLVM bitcode in there for
Halide would mean that we wouldn't want to run newer versions of the
test-suite on older versions of the compiler, etc).

So, yeah, just curious about the practical problems, no philosophical
objection I suppose.

- Dave

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:
1) It contains all manner of crazily licensed code.
2) We don't really care about the history at all. Any concerns around
linear history or bisection are pretty much irrelevant.
3) We don't ever plan to have LLVM code move into or out from the test-suite
4) Its already big, and really should be much bigger. We shouldn't have
incentives to keep stuff out of the test suite because of size, hosting
cost, or anything else.

There are two size limitations w.r.t. moving to github:

1) They have a prohibition on having individual files over 100M [1], unless the repository is set up with their Git LFS plugin. This plugin hasn't been around all that long, and is not something I would use in production. (Not to mention the fact that it's not supported on public forks anyway [2])

The largest individual files in the test-suite are around 10-12M, so this isn't a problem yet, but it could become one later... something to think about.

2) There is the cap on total repository size, which is in the neighborhood of 1Gb [3]. A fresh checkout of test-suite clocks in at just over 3Gb. This one actually is a problem.

For all of these reasons, and also because I'd like to see how well (or
rather, how poorly) a service like GitHub actually works for the
project, it seems like splitting the test-suite out of the current
subversion repository and moving it there is the right call.

When I chatted with folks on the board, this made sense to them as well,
and I've made sure we have a reasonable LLVM organization set up on
GitHub and all the board members are on it: https://github.com/llvm (I
think only my membership is public at the moment).

There is still plenty to figure out about how to manage this on github,
but before doing anything else I just wanted to shoot an email and see
if folks like this idea.

+1, assuming it can be made to work given the other concerns above.

Jon

1: Managing large files - GitHub Docs
2: https://github.com/github/git-lfs/issues/773#issuecomment-150569337
3: About large files on GitHub - GitHub Docs

I don’t really care where the repository is located, but I do have some comments on the future test-suite directions:

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:

  1. It contains all manner of crazily licensed code.

That’s indeed a good reason to move the repository away.

  1. We don’t really care about the history at all. Any concerns around linear history or bisection are pretty much irrelevant.

We do care about the history. Sometimes benchmarks get fixed or tweaked which may change the results, we should be able to dig into the history to see what happened when. In any way retaining the history wouldn’t be a problem, would it?

  1. We don’t ever plan to have LLVM code move into or out from the test-suite

I could actually see moving llvm code into the test-suite (we already use lit code from llvm) but indeed move code out of the testsuite into llvm I don’t foresee happening.

  1. Its already big, and really should be much bigger. We shouldn’t have incentives to keep stuff out of the test suite because of size, hosting cost, or anything else.

I agree with the goal of having a big test-suite. However I think there is a point where we should rather strive to have a stable base system for building and running tests, etc. and then have the actual benchmarks/tests being modules on top of that. We already have that situation today with External/SPEC* and I think it would be a good idea to have a mode where you just checkout more benchmarks into a test-suite subdirectory and they are automatically recognized and used (in fact that is something on my TODO list though at a very low position).

  • Matthias

Dear Chandler,

First, can you articulate why you want to move the test suite to Github? Is it taking up too much space, or is there some other problem that you're trying to solve? I think you clearly explain why moving the revision history isn't necessary, but it's not clear to me what problem you are trying to solve.

Second, if we move the revision history to Github, it would be nice to archive the existing Subversion history somewhere (e.g., leave it on llvm.org but disable commit access to it). The test suite has been used in numerous research papers, so keeping the revision history around is good practice. We should only delete the Subversion revision history if keeping it around is just too onerous.

Third, I assume your plan is to continue to track changes on Github. Is that correct?

As long as there's a good reason to do it and the existing Subversion history isn't deleted, I don't see a problem with the change.

Regards,

John Criswell

I don’t see that as a reason to move the repository. Where a repository lives and what format it uses is an orthogonal issue to what license the software within the repository is allowed to use. There are policies on what licenses can be used for Clang and LLVM code; the policy for the allowed licenses in test-suite is just different. I think Chandler’s point (Chandler, please correct me if I’m wrong) is that it’s not important to a) match the test suite revision numbers to LLVM source code revision numbers and b) copy the SVN history to the Github repository. John Criswell

Okay if this is about increasing subversion revision numbers. There is always data/time to related commits to each other. And I’d agree that matching a specific llvm/clang revision to a test-suite revision is not that useful. In fact I like the fact that you can mix and match different clang and test-suite revisions and not just have 1 giant checkout with clang/llvm + test-suite moving in sync.
As for history: The old subversion revision numbers are also still part of the commit descriptions anyway, so it is still possible to reconstruct things.

  • Matthias

This is not really true. Individual pack files must be below 1GB, but
the total repository can be much larger. That's true even for the free
tiers. That said, it might be good idea to split the repository into
modules to keep it managable.

Joerg

There are two size limitations w.r.t. moving to github:

1) They have a prohibition on having individual files over 100M [1],
unless the repository is set up with their Git LFS plugin. This plugin
hasn't been around all that long, and is not something I would use in
production. (Not to mention the fact that it's not supported on public
forks anyway [2])

The largest individual files in the test-suite are around 10-12M, so
this isn't a problem yet, but it could become one later... something to
think about.

I'd expect the LFS plugin to be mature before this becomes an issue.

2) There is the cap on total repository size, which is in the
neighborhood of 1Gb [3]. A fresh checkout of test-suite clocks in at
just over 3Gb. This one actually is a problem.

Talk to the GitHub folks. They say they will make exceptions if you can explain your needs.
I wouldn't be surprised if GitHub went out of their way to get LLVM onto their boat.

Regards,
Jo.

2) There is the cap on total repository size, which is in the neighborhood
of 1Gb [3]. A fresh checkout of test-suite clocks in at just over 3Gb. This
one actually is a problem.

This is not really true.

From the horse's mouth:

"We recommend repositories be kept under 1GB each. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."

For all of these reasons, and also because I'd like to see how well (or
rather, how poorly) a service like GitHub actually works for the project,
it seems like splitting the test-suite out of the current subversion
repository and moving it there is the right call.

My experience from a few years of contributing and a bit of project setup:

#1) GitHub works really well for public discussions of code changes (pull requests) and issues.

#2) Labels are too roughly granular to be very useful. Don't expect to be doing request priorization etc. on that route. (This may change in the future.)

#3) Contributors without commit rights need to set up a fork (really a "git clone") on GitHub, and commit to that before they can issue a pull request.
This sounds easy enough in theory, but for your local work you pull and merge from "origin" and push to "fork", and then you go to the GH site and start the pull request, and once it's in you start cleaning up work branches both locally and on GitHub. It's a lot of clerical work, and some aspects of this all are easy to get wrong for a git newbie.

I found that GitLab does #3 much better.
People can push their work branches directly to the project repo. You can set up the master branch (or any other branch) to require committer rights, so you can protect the master branch and don't have to force contributors into their own repositories.
The rest is roughly on par with GitHub. GitLab isn't as polished in all respects.

Oh, and the GitLab server-side code is available as Open Source. So if you want to set up your own servers, you can do that.
GitHub did not open source their server code. If Github goes down, the projects need to migrate to a different git hoster.

There is still plenty to figure out about how to manage this on github, but
before doing anything else I just wanted to shoot an email and see if folks
like this idea.

Feel free to ask, I have been through some of the setup pain.

Regards,
Jo

The test-suite is really weird relative to the rest of the LLVM project:
1) It contains all manner of crazily licensed code.
2) We don't really care about the history at all. Any concerns around linear
history or bisection are pretty much irrelevant.
3) We don't ever plan to have LLVM code move into or out from the test-suite
4) Its already big, and really should be much bigger. We shouldn't have
incentives to keep stuff out of the test suite because of size, hosting
cost, or anything else.

5) It could be used by other compilers / projects that are not LLVM related.
6) We could accept pull-requests from a much larger community
7) GitHub (or similar) can scale *A LOT* better than our
infrastructure, probably even use CDNs etc.

There is still plenty to figure out about how to manage this on github, but
before doing anything else I just wanted to shoot an email and see if folks
like this idea.

Maybe put LNT in there, too?

Some downsides:
* the separate administration of commit access for new developers,
but should be pretty low cost
* we'll have a full non-linear Git solution (no SVN behind) for some
projects, thus branches, merges, etc will be harder to tag for
releases.

I'd also avoid going for a full GitHub model (few commit access, need
pull request), since this will be different from the current LLVM
model. But we can easily apply the current LLVM model to GitHub by any
committer accepting pull-requests from the wider community, just like
we commit for people without access today.

Overall, seems like a good choice for me. +1.

--renato

I don't think I've ever actually successfully run the test-suite and
fully understood what I was doing. As you say, it is kind of a weird
relative to the LLVM project, but wouldn't moving it out of the
repository make it even more so? I wish it was instead more integrated
with the project, more useful, and better understood.

We branch, tag, and release test-suite as we do with the other modules
as part of the release process. That's maybe not super important for
most folks, but moving it to a separate repository would make that
process more complicated.

If the main motivation is 4), maybe we should consider moving the
whole repository to something that scales better?

Sorry if this is coming across as negative, but it just seems that the
most natural place for the LLVM test-suite is with the rest of LLVM,
so I don't see why we should move it without a good reason.

- Hans

I'm really in favor of a modular test-suite, splitting as much as possible the infrastructure with the individual suites, and making it super easy and convenient to assemble modular suites together. We already have the mechanism for "external" test-suite, it would just need to be first-class.
i.e. I'd like to clone the test-suite, which would get me only the infrastructure but no test to run, and then easily say: "clone these test-suite and make them available: llvm, Halide, etc." which would clone them from separate repositories. And then list the available suite and quick runs over chosen ones.

Just some thoughts...

I know what they write and I know what answer I got when I asked about
it for NetBSD's git mirror. I certainly know that I have a repository
larger than 1GB on github.

Joerg

It might be useful as an experiment.
If it goes well, the rest of LLVM could follow suit. If it does not go well, the test suite could be moved back.

Sites like GitHub lower the entry barrier into project contribution.

That would be *my* motives; I don't know the motives of the LLVM team.

Regards,
Jo

Dear Chandler,

First, can you articulate why you want to move the test suite to Github? Is it taking up too much space, or is there some other problem that you’re trying to solve? I think you clearly explain why moving the revision history isn’t necessary, but it’s not clear to me what problem you are trying to solve.

Well, I tried in my original email, but perhaps I should state the issue more generally.

The costs of us managing our own hosting of the test suite seem higher than for the rest of the project (size, scope, license diversity, etc), and yet the benefits of us managing our own hosting (compared to using a managed service like github) seem much lower.

It will also make checking out the test suite, especially as it grows, substantially faster.

And I really do think the test suite should grow, and grow a lot. I don’t think we should always run all of it, I actually think having good, focused slices of the test suite is really important (this has come up elsewhere on the thread). But I think we should also be in the business of making it easier to get more testing for LLVM. And one way to do that would be to move to a faster and cheaper (in maintenance/support terms) solution such as using well known managed hosting like github.

So ultimately, I guess I’m trying to clear a path for growth of the test suite (within reason) and reduce support burden on our common infrastructure.

Neither are really pressing problems, but they both seem worth addressing.

Second, if we move the revision history to Github, it would be nice to archive the existing Subversion history somewhere (e.g., leave it on llvm.org but disable commit access to it). The test suite has been used in numerous research papers, so keeping the revision history around is good practice. We should only delete the Subversion revision history if keeping it around is just too onerous.

Oh, I wouldn’t want to delete it. Your re-interpretation was correct, I just mean that a strict, linear, correlated flow of history common to the test suite and the compiler doesn’t seem important. Sorry for confusion, i’ll follow up more on the history point on the relevant sub-thread.

Third, I assume your plan is to continue to track changes on Github. Is that correct?

Yep. I definitely wouldn’t want to see any real changes to process here, just a different “master” so-to-speak. But this also gets to the “there would be a ton of stuff to figure out if this is the right direction” issue. =] So sorry for the hand waving.

I don’t really care where the repository is located, but I do have some comments on the future test-suite directions:

Just as a meta-point, I don’t want to conflate any of this with a specific design direction. I’m really focused on “where is it hosted” as a simplifying thing for the projects infrastructure.

Subject kinda says it all. Here is my rationale:

The test-suite is really weird relative to the rest of the LLVM project:

  1. It contains all manner of crazily licensed code.

That’s indeed a good reason to move the repository away.

  1. We don’t really care about the history at all. Any concerns around linear history or bisection are pretty much irrelevant.

We do care about the history. Sometimes benchmarks get fixed or tweaked which may change the results, we should be able to dig into the history to see what happened when. In any way retaining the history wouldn’t be a problem, would it?

See John’s response, and sorry for the broad statement.

I meant, we don’t care about a shared linear monotonic history with the rest of the compiler that we can bisect across simultaneously.

Clearly we still want version control!

  1. We don’t ever plan to have LLVM code move into or out from the test-suite

I could actually see moving llvm code into the test-suite (we already use lit code from llvm) but indeed move code out of the testsuite into llvm I don’t foresee happening.

Well, I think it might make sense to separate the LLVM code used by the test suite from the test suite itself. I’d be happy to keep that code in the LLVM repository to the extent possible (perhaps with expanded stuff under utils/…).

  1. Its already big, and really should be much bigger. We shouldn’t have incentives to keep stuff out of the test suite because of size, hosting cost, or anything else.

I agree with the goal of having a big test-suite. However I think there is a point where we should rather strive to have a stable base system for building and running tests, etc. and then have the actual benchmarks/tests being modules on top of that. We already have that situation today with External/SPEC* and I think it would be a good idea to have a mode where you just checkout more benchmarks into a test-suite subdirectory and they are automatically recognized and used (in fact that is something on my TODO list though at a very low position).

No argument to me about needed better organization, modularity and such. And we definitely need to have reasonably small slices that we really care about.

I think its useful to the extent possible to provide a common repository so that folks don’t have to aggregate too many things just so that we can have more productive discussions (“Well, where did you pull benchmark Whizzbang from? Oh, I have a different variant of it, so that’s why I don’t see that regression”).

But none of that should argue against better modularity and extensibility in it

  1. There is the cap on total repository size, which is in the neighborhood
    of 1Gb [3]. A fresh checkout of test-suite clocks in at just over 3Gb. This
    one actually is a problem.

This is not really true. Individual pack files must be below 1GB, but
the total repository can be much larger. That’s true even for the free
tiers. That said, it might be good idea to split the repository into
modules to keep it managable.

We can also handle this lazily – if and when it becomes a problem, we can re-factor the repo. I don’t think we need to overthink this.

Also, the Github folks might be willing to help. =]