GitHub Survey?

Folks,

It's 1st of September, and we don't have the document nor the survey
ready. With the US meeting on 3-4 November, that leaves us only 2
months to do everything, and I'm not sure we'll be able to if we delay
much more.

Being the devil's advocate and hoping this doesn't spiral down
(again), there were a few pertinent questions left unanswered from the
previous threads...

1. How much vote, how much survey?

Most people are in agreement that a balance has to be struck, though
there is no clear consensus on how much of each.

The concerns raised were:
* As it is now, we may not have enough information on why, just what
people want.
* Without multiple choice questions, it'll be too hard to infer what
people want.
* With too many questions, statistical relevance will be diluted and
spread too thin.

FWIW, I'm personally happy with the current format. But this is not about me.

2. How much information do we want?

Chris' point on the other thread is that he wants a lot more
information, so we can derive all sorts of correlations and drive
other decisions, not just GitHub. I think this is a valid point, but
it depends on the time scale.

The concerns are:
* If we want to have enough meaningful results by the US LLVM, we
need to get it online ASAP
* It'll take months to converge on a long list of questions, publish,
get feedback, analyse it all
* The result will be more general, so we only do it once, but won't
help the GitHub choice
* Free text interpretation is fuzzy and easy to bias, concrete
decisions need concrete information

3. The current GitHub document

We need to update the current document to have both options:
sub-modules and mono-repo, with an exact description of what they
mean.

The sub-modules discussion has finished with a solid proposal, though
there aren't many example use cases, which makes it hard to
demonstrate. But the mono-repo discussion was still around which
repositories belong to the mono-repo and which don't, and what do we
do about the others.

So, the next steps on this task are:
* Rename the file to GitHub.rst (since it'll have both proposals in there)
* Update the sub-modules part, adding example use cases
* Insert the whole mono-repo proposal, with what's in and out + examples
* Update the migration plan to cope with either option (should be
similar anyway)

== Next Steps

I'm attaching the pre-survey dump, with an anonymised CSV file (no
names or emails), plus a number of pictures of the pie charts. We'll
put it up in a web document to look nicer with the official survey.

So, the next steps are:

1. Make sure everyone is happy with the survey, and change it if necessary.
1.a Make sure the results are in a format that people expect / can work with.

2. Make the appropriate changes to the proposal document.
2.a Review, reach consensus, approve, publish.

3. Put the survey online, this time for real, deleting all current
responses first.
3.a *Everyone* that has filled it already will have to fill it again, sorry.

4. We'll need a deadline, so people feel compelled to answer sooner.
4.a When the deadline is reached, collate the results in a CSV file+PNGs.
4.b A group of people (volunteers? foundation?) will analyse the data
and decide how to present.
4.c I volunteer to write the final document, with their final blessing
before publishing.
4.d Hopefully before the US LLVM (4 Nov), we can have both the
proposal and survey results.

5. At the US LLVM, hold a BoF / Panel on the GitHub move.
5.a Use the proposal and survey results as a starting point.
5.b Gather the feedback, create a third document (I volunteer again).

6. Someone(s) take(s) the final decision...

cheers,
--renato

GitHubSurvey.zip (257 KB)

Hey Renato,

I want to again thank you for all your work herding this process along.

If I may for a moment I’d like to work backwards through your “Next Steps”. The last step is that someone (or a group of people) will make the final decision based on the information in the survey and the conversation in the BoF.

A lot of the earlier steps depend on who and how the decision is to be made. For example, if the intent is that the final decision be dictated by the result of the survey, it is less a survey more a vote. On the other hand if the intent is that a group of people interpret survey results and make a decision based on survey results (although not dictated by them), then the survey should be worded differently.

It might make the process of constructing an appropriate survey easier if we first figure out who will be the decision makers, and how they are intended to make the decision.

My personal preference would be for the decision makers to be either a committee of developers or the LLVM Foundation board, and I would prefer if the survey were crafted to provide them with information to inform a decision, rather than a dictation of a decision.

-Chris

Hi Chris,

Those are very good points, and I agree with you.

In line with Chris Lattner's previous comments, that this is a purely
technical decision with far reaching consequences, I think the
foundation should continue its role as helpers and not decision
makers. However, we don't yet have a process of forming technical
committees for such cases, and given the delicate nature of this
decision (illustrated by the number of long and heated threads), it
won't be an easy process either. It's quite possible that the
discussions on how to form a representative committee will be as
heated or more so than the git discussion.

Picking code owners or top committers as a base would put out a lot of
people that work hard on LLVM but never ended up owning a piece of
code and would be telling that committing a lof of patches is somehow
better than a few well done and life changing ones. Maybe we do need
to start a parallel discussion on how would this technical committee
would look like, so we can take such decisions while still being
representative with more ease in the future. But none of that is fast
enough, I think.

The only easy solution I can see right now, is for Chris Lattner to
decide, based on both the survey and the BoF. I personally trust Chris
to take a decision that will be best for the community, but I
obviously don't speak for the whole community.

In that case, a voting-in-survey-clothing may not be a bad idea, as
it'll give us an idea of numbers. But we'll need more specific
questions and they can be slightly less statistically relevant, since
the decision process will not rely on them being representative.

Can you suggest a few additional questions to the survey? We had a lot
of feedback in the beginning, but the last one was several weeks ago.

I've added Chris as an editor, Tanya had access already. I don't want
to be the only editor in this, and it'd be good if they could change
it to suit the kind of information a committee/person would be looking
into.

cheers,
--renato

My personal preference would be for the decision makers to be either a
committee of developers or the LLVM Foundation board, and I would prefer if
the survey were crafted to provide them with information to inform a
decision, rather than a dictation of a decision.

Hi Chris,

Those are very good points, and I agree with you.

In line with Chris Lattner's previous comments, that this is a purely
technical decision with far reaching consequences, I think the
foundation should continue its role as helpers and not decision
makers. However, we don't yet have a process of forming technical
committees for such cases, and given the delicate nature of this
decision (illustrated by the number of long and heated threads), it
won't be an easy process either. It's quite possible that the
discussions on how to form a representative committee will be as
heated or more so than the git discussion.

Picking code owners or top committers as a base would put out a lot of
people that work hard on LLVM but never ended up owning a piece of
code and would be telling that committing a lof of patches is somehow
better than a few well done and life changing ones. Maybe we do need
to start a parallel discussion on how would this technical committee
would look like, so we can take such decisions while still being
representative with more ease in the future. But none of that is fast
enough, I think.

All of these points are totally valid, and I completely agree.

The only easy solution I can see right now, is for Chris Lattner to
decide, based on both the survey and the BoF. I personally trust Chris
to take a decision that will be best for the community, but I
obviously don't speak for the whole community.

I also would trust Chris in making a decision that would be best for the community.

In that case, a voting-in-survey-clothing may not be a bad idea, as
it'll give us an idea of numbers. But we'll need more specific
questions and they can be slightly less statistically relevant, since
the decision process will not rely on them being representative.

I think having the survey contain a question on which solution the respondent prefers is good, but I feel it is very limited. As someone who has been trying very hard to process both proposals and all the email threads I have found the lack of data to drive the decision troubling. In particular because this is as you put it “a purely technical decision with far reaching consequences” I very strongly believe data is our friend in helping make the best long term decision.

Can you suggest a few additional questions to the survey? We had a lot
of feedback in the beginning, but the last one was several weeks ago.

Again, to work backwards, rather than starting with questions, let me suggest information that I think we could generate from a survey that would help inform the decision.

* How many users are already using Git as their primary way of interacting with LLVM sources?

In general I believe there are three possible decisions that could come from this. Either we go with one of the two proposals, or we change nothing. Knowing how many users are already using Git speaks to the impact of a transition from SVN to Git regardless of which Git proposal, and is a useful bit of data to collect.

* Which projects people are using, and which ones they contribute to.

Knowing this allows the decision maker(s) to weight results based on the opinions and impacts on contributors differently from the opinions and impacts on down-stream users. While I certainly don't think we should disregard downstream users, this decision will disproportionately impact contributors, so we need to take that into account.

In addition to that I believe we should actually provide a section of the survey specifically for questions that can inform the specific proposals, and improve them.

For example, the mono-repo proposal currently lacks firm details on a few things, which might benefit from survey answers. For example, which projects should be included in the mono-repo, or are per-project git mirrors important.

-Chirs

I think having the survey contain a question on which solution the respondent prefers is good, but I feel it is very limited.

Well, the current survey is more than just one question...

In general I believe there are three possible decisions that could come from this. Either we go with one of the two proposals, or we change nothing. Knowing how many users are already using Git speaks to the impact of a transition from SVN to Git regardless of which Git proposal, and is a useful bit of data to collect.

We have three specific questions as to what the impact of the
transition (one for each option short term plus one for long term).
Why doesn't that cover your proposal?

Knowing this allows the decision maker(s) to weight results based on the opinions and impacts on contributors differently from the opinions and impacts on down-stream users. While I certainly don't think we should disregard downstream users, this decision will disproportionately impact contributors, so we need to take that into account.

We already have that *exact* question.

It's a multiple choice questions where you check all projects that you
work on. The pie chart means nothing to the GitHub survey per se, but
the association of that information with the repo choice and the
impact will be very valuable.

For example, if most libc++ developers prefer option X while core LLVM
developers prefer Y.

What would you suggest in addition to what's already there?

In addition to that I believe we should actually provide a section of the survey specifically for questions that can inform the specific proposals, and improve them.

Right. I steered away from that on purpose, but I guess we can add one
free text field for each impact question, and mention that people
should laid out why this or that option is not good, and how to make
them good, in the case where it's chosen.

For example, the mono-repo proposal currently lacks firm details on a few things, which might benefit from survey answers. For example, which projects should be included in the mono-repo, or are per-project git mirrors important.

Indeed, this is currently missing. Would that be enough to cover this
in the free-text I mention above?

cheers,
--renato

I think having the survey contain a question on which solution the respondent prefers is good, but I feel it is very limited.

Well, the current survey is more than just one question…

Sorry, I didn’t mean to imply that was the only question.

In general I believe there are three possible decisions that could come from this. Either we go with one of the two proposals, or we change nothing. Knowing how many users are already using Git speaks to the impact of a transition from SVN to Git regardless of which Git proposal, and is a useful bit of data to collect.

We have three specific questions as to what the impact of the
transition (one for each option short term plus one for long term).
Why doesn't that cover your proposal?

There is a difference between asking “How much does this impact you?” verses getting data on what about it causes impact. For example someone who thinks a transition will greatly impact them but also is already using git may have specific interesting concerns.

Knowing this allows the decision maker(s) to weight results based on the opinions and impacts on contributors differently from the opinions and impacts on down-stream users. While I certainly don't think we should disregard downstream users, this decision will disproportionately impact contributors, so we need to take that into account.

We already have that *exact* question.

It's a multiple choice questions where you check all projects that you
work on. The pie chart means nothing to the GitHub survey per se, but
the association of that information with the repo choice and the
impact will be very valuable.

For example, if most libc++ developers prefer option X while core LLVM
developers prefer Y.

What would you suggest in addition to what's already there?

Where I think the current survey is lacking is the ability to differentiate uses from contributions. Probably the only thing we need to add to the survey to cover this is either a clear statement that the email provided should be the email tied to their SVN account, or a request for SVN username if the person has commit access.

In addition to that I believe we should actually provide a section of the survey specifically for questions that can inform the specific proposals, and improve them.

Right. I steered away from that on purpose, but I guess we can add one
free text field for each impact question, and mention that people
should laid out why this or that option is not good, and how to make
them good, in the case where it's chosen.

For example, the mono-repo proposal currently lacks firm details on a few things, which might benefit from survey answers. For example, which projects should be included in the mono-repo, or are per-project git mirrors important.

Indeed, this is currently missing. Would that be enough to cover this
in the free-text I mention above?

As I’ve said in the past analyzing data from free text fields will be unwieldy. By my count, we had 473 contributors across clang, clang-tools-extra, compiler-rt, libcxx, libcxxabi, libunwind, lld, lldb, and llvm in the last year*. In an ideal world we’d get 100% response to the survey plus additional responses from downstream users who aren’t contributors. Analyzing free-text fields for hundreds of respondents to get data that could come from very simple questions seems less than ideal.

All that aside, none of this may matter. It largely depends on who is making the decision and how they are making the decision. It might be nice to get some consensus around that as a starting point.

-Chris

*Data gathered by running:
for repo in $(find . -name .git); do git --git-dir=$repo log --format=%an --since='1 year ago'; done | sort | uniq -c | wc -l

There is a difference between asking “How much does this impact you?” verses getting data on what about it causes impact. For example someone who thinks a transition will greatly impact them but also is already using git may have specific interesting concerns.

Indeed, I think we can elaborate on the description of the text areas.

Where I think the current survey is lacking is the ability to differentiate uses from contributions. Probably the only thing we need to add to the survey to cover this is either a clear statement that the email provided should be the email tied to their SVN account, or a request for SVN username if the person has commit access.

There is a question on what kind of contribution you have (core dev,
user, interested party, etc).

As I’ve said in the past analyzing data from free text fields will be unwieldy. By my count, we had 473 contributors across clang, clang-tools-extra, compiler-rt, libcxx, libcxxabi, libunwind, lld, lldb, and llvm in the last year*. In an ideal world we’d get 100% response to the survey plus additional responses from downstream users who aren’t contributors. Analyzing free-text fields for hundreds of respondents to get data that could come from very simple questions seems less than ideal.

We also have non-committers that work on LLVM every day, and some of
those people's work is probably even more relevant (infrastructure,
release, products, validation, development environment, etc). However,
I don't expect to get 473 responses, but something between 100 and
200, which already would be fantastic.

All that aside, none of this may matter. It largely depends on who is making the decision and how they are making the decision. It might be nice to get some consensus around that as a starting point.

True, but we can't bet on that. :slight_smile:

Can you write up what description would be good for the free-text
questions around the "impact"? I think all the other concerns are
already covered by the current fields.

cheers,
--renato

Folks,

After feedback from Chris and Mehdi, I have added one long text answer
to *each* critical questions (impact on productivity), so that people
can extend their reasoning.

But I have not made them compulsory, so that people that don't know
much about or don't have any problem don't feel compelled to expand on
nothing.

https://docs.google.com/forms/d/e/1FAIpQLSc2PBeHW-meULpCOpmbGK1yb2qX8yzcQBtT4nqNF05vSv69WA/viewform

The bottom line is, answers with more substance will be taken more
seriously than those without, but the numbers should also tell us
something.

For example, if 9/10 of the answers are "we don't care", than the
other 1/10 will have less weight than if it was 1/2, but still
important to any decision.

Can we go live with what we have now?

Mehdi, how's the document? Can we get that online so I can change the
header of the survey?

We only have a month and a bit now...

cheers,
--renato

Hi Renato,

Thanks very much for putting this together.

I think the proposal document is almost finished now. Since I ended up reviewing it pretty thoroughly, I've gained a bit of understanding about the concerns we need input on.

The survey is a great start, but the final page isn't quite addressing the concerns in the final proposal. I'm not sure it asks the right questions to focus the conversation at the BoF.

Firstly, I'm concerned that the questions focus on:
- what's good for the individuals responding, instead of what they think is best for the LLVM project; and
- how much pain the transition would cause, instead of what they think the right final state is.

Secondly, I'm worried about this question: "How does the choice between a single repository with all projects and the use of sub-modules impact your usage of Git?" I'm not sure we'll good signal from this; it's essentially a vote on the two variants, but it doesn't force the respondent to think about the specific issues. I'd rather find a way to ask about the specific concerns raised in the document.

Thirdly, I'm worried that the follow-ups talk about "preferred" and "non-preferred" instead of "multirepo" and "monorepo". This makes data-mining non-trivial (because the meaning depends on previous answers) and increases the chance of respondent confusion.

I spent some time today thinking through what set of questions would get us the data we want.
- I've focused on the main concerns about (and benefits of) the two variants.
- I've referred to the multirepo and monorepo by name (consistently) in questions asking about them. This ensures that people know exactly what they're answering.
- I've added specific questions about how people plan to use the multirepo and monorepo, so that we know which tooling is most important (and also to determine how worried we should be about some of the concerns).
- I've moved the "vote"-like question to the end to force respondents to think through the issues first. I've also restricted "the vote" to "the ideal project setup", so we can clearly separate that from "transition pain". (I'm still not sure the vote will have much value, but it doesn't hurt.)

Here are my suggested questions; feedback welcome:

Screen Shot 2016-10-12 at 16.16.05.png.pdf (60.3 KB)

  1. How important is cross-project blame, grep, etc.?

I don’t understand “cross-project blame” as it works on one file at a time?

Also, is this question intended to cover bisection?

Thanks,

–paulr

> 6. How important is cross-project blame, grep, etc.?
<>
I don't understand "cross-project blame" as it works on one file at a time?

True, not straightforward blame.
My workflow when trying to track the history of some code involves frequently blaming recursively.
When I reach a commit that moved the function, I switch file, but continue *from the same commit*. This is only possible in the same repository.
Another thing I use daily is finding “when was this added”, using git “pickaxe” (git log S ….). Which again works only inside a repository.

Hi,

Thanks a lot Duncan, I really like this! I totally support adopting this scheme now. See inline a few quite minor comments.

Renato: are you still interested and available now to set-up the survey? We should close on this this week.

Folks,

After feedback from Chris and Mehdi, I have added one long text answer
to each critical questions (impact on productivity), so that people
can extend their reasoning.

But I have not made them compulsory, so that people that don’t know
much about or don’t have any problem don’t feel compelled to expand on
nothing.

https://docs.google.com/forms/d/e/1FAIpQLSc2PBeHW-meULpCOpmbGK1yb2qX8yzcQBtT4nqNF05vSv69WA/viewform

The bottom line is, answers with more substance will be taken more
seriously than those without, but the numbers should also tell us
something.

For example, if 9/10 of the answers are “we don’t care”, than the
other 1/10 will have less weight than if it was 1/2, but still
important to any decision.

Can we go live with what we have now?

Mehdi, how’s the document? Can we get that online so I can change the
header of the survey?

Hi Renato,

Thanks very much for putting this together.

I think the proposal document is almost finished now. Since I ended up reviewing it pretty thoroughly, I’ve gained a bit of understanding about the concerns we need input on.

The survey is a great start, but the final page isn’t quite addressing the concerns in the final proposal. I’m not sure it asks the right questions to focus the conversation at the BoF.

Firstly, I’m concerned that the questions focus on:

  • what’s good for the individuals responding, instead of what they think is best for the LLVM project; and
  • how much pain the transition would cause, instead of what they think the right final state is.

Secondly, I’m worried about this question: “How does the choice between a single repository with all projects and the use of sub-modules impact your usage of Git?” I’m not sure we’ll good signal from this; it’s essentially a vote on the two variants, but it doesn’t force the respondent to think about the specific issues. I’d rather find a way to ask about the specific concerns raised in the document.

Thirdly, I’m worried that the follow-ups talk about “preferred” and “non-preferred” instead of “multirepo” and “monorepo”. This makes data-mining non-trivial (because the meaning depends on previous answers) and increases the chance of respondent confusion.

I spent some time today thinking through what set of questions would get us the data we want.

  • I’ve focused on the main concerns about (and benefits of) the two variants.
  • I’ve referred to the multirepo and monorepo by name (consistently) in questions asking about them. This ensures that people know exactly what they’re answering.
  • I’ve added specific questions about how people plan to use the multirepo and monorepo, so that we know which tooling is most important (and also to determine how worried we should be about some of the concerns).
  • I’ve moved the “vote”-like question to the end to force respondents to think through the issues first. I’ve also restricted “the vote” to “the ideal project setup”, so we can clearly separate that from “transition pain”. (I’m still not sure the vote will have much value, but it doesn’t hurt.)

Here are my suggested questions; feedback welcome:


  1. How do you use LLVM?
    // as is

  2. Which projects do you contribute to / use?
    // as is

For this last question, I’d keep a check-boxes list with an optional blank field. It seems that the list of projects being limited, checking boxes is both easier for the people answering and for us doing the data-mining.
(What Renato has set-up looks good to me here)

  1. Use this field to expand on your usage, if necessary
    // as is
  1. How often do you work on a small LLVM sub-project without using a checkout of LLVM itself?
  • Always.
  • Most of the time.
  • Sometimes.
  • Never.
  1. Please categorize how you interact with upstream.
  • I need read/write access, and I have limited disk space.
  • I need read/write access, but a 1GB clone doesn’t scare me.
  • I only need read access.
  1. How important is cross-project blame, grep, etc.?
  • Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
  • Extremely. It should be easy enough that everyone does it by default.
  • Somewhat. I would use it if it were easy, but it’s just nice to have.
  • Not at all. Anyone who cares can write their own tooling.
  1. Single-commit cross-project refactoring designs away a class of build failures and simplifies making API changes. How important is it?
  • Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
  • Extremely. It should be easy enough that everyone does it by default.
  • Somewhat. I would use it if it were easy, but it’s just nice to have.
  • Not at all. Anyone who cares can write their own tooling.
  1. The multirepo variant provides read-only umbrella repository to coordinate commits between the split sub-project repositories using Git submodules. Assuming multirepo gets adopted, how do you expect to use the umbrella?
    // checkboxes:
  • Actively contribute tooling improvements to improve it.
  • Integrate it into our downstream fork.
  • Use it for upstream contributions.
  • Use it as the primary interface development environment.
  • Use it for bisection.
  1. If multirepo is adopted, how do you plan to contribute to upstream?
  • Using Git submodules.
  • Using the Git repos directly.
  • Using the SVN bridges.
  • n/a: I don’t contribute.

Can you clarify what “Using Git submodules” mean?
Since the umbrella is read-only, it is not clear to me.
Removing this first answer and keeping the others makes sense to me on the other hand.

  1. The monorepo variant provides read/write access to sub-projects via an SVN bridge and git-svn. Contributors will have the option to continue using repositories split on project boundaries. Assuming monorepo gets adopted, how do you plan to contribute?
  • I’ll use the monorepo as soon as it’s possible, even before it’s canonical.
  • I’ll use the monorepo as soon as it’s canonical.
  • I’ll transition to monorepo eventually.
  • I’ll use the SVN bridge on separated sub-projects forever.
  • I’ll use a Git mirror (and/or git-svn) on separated sub-projects forever.
  • n/a: I don’t contribute.
  1. If monorepo is adopted, how do you plan to integrate it downstream?
  • We already use monorepo.
  • We’ll switch to pulling from monorepo during the transition period.
  • We’ll switch to pulling from monorepo eventually.
  • We’ll integrate from the SVN bridge forever.
  • We’ll integrate from the split sub-project Git mirror forever.
  • n/a: There is no downstream.
  1. The multi/mono hybrid variant merges some sub-projects, but leaves runtimes in separate repositories using the umbrella to tie them together. Is this the best or worst of both worlds?
  • This is great. Native cross-project refactoring, without penalizing runtime-only developers.
  • Whatever. I’ll deal with it.
  • This is terrible. All the transition pain of monorepo, without the advantages.
  1. If multirepo is adopted, how much pain will there be in your transition?
  • Nothing consequential.
  • A little; but it’ll be fine.
  • A lot; but it’ll get done somehow.
  • Too much; I/we may stop contributing to LLVM.
  1. If monorepo is adopted, how much pain will there be in your transition?
  • Nothing consequential.
  • A little; but it’ll be fine.
  • A lot; but it’ll get done somehow.
  • Too much; I/we may stop contributing to LLVM.

For the two previous question, I’d add an answer “n/a: I don’t contribute”.
(We keep readonly views in both cases, not clear to me how it affects non-contributors).

When I reach a commit that moved the function, I switch file, but continue
*from the same commit*. This is only possible in the same repository.

Due to their physical separation, I don't think people have been
moving code between projects in that manner.

If we keep the separation (sub-mods), is it possible to move files
between sub-modules?

If we do mono-repo, then the problem goes away.

Another thing I use daily is finding “when was this added”, using git
“pickaxe” (git log S ….). Which again works only inside a repository.

Uh, that looks nice. Can you share your alias? :slight_smile:

cheers,
--renato

> 6. How important is cross-project blame, grep, etc.?

I don't understand "cross-project blame" as it works on one file at a time?

True, not straightforward blame.
My workflow when trying to track the history of some code involves frequently blaming recursively.
When I reach a commit that moved the function, I switch file, but continue *from the same commit*. This is only possible in the same repository.
Another thing I use daily is finding “when was this added”, using git “pickaxe” (git log S ….). Which again works only inside a repository.

Maybe it makes sense to leave blame out of it, since it would be awkward to explain here? Happy with suggested wording either way (probably at least move "grep" to the front of the list, since that one is obvious).

Also, is this question intended to cover bisection?

Not exactly, but maybe a little. It's also partially covered by the checkbox in #8.

Note that bisection will work pretty similarly between the variants (assuming write-once tooling is in place for the umbrella). Just now I tried to come up with wording specifically for bisection (either in the style of #7, or using checkboxes to sort out which differences matter), but failed to come up with something that (a) we'd get real data from and (b) wasn't already covered by other questions about tooling differences.

If you have a suggestion, that would be great.

When I reach a commit that moved the function, I switch file, but continue
from the same commit. This is only possible in the same repository.

Due to their physical separation, I don’t think people have been
moving code between projects in that manner.

Most people are not doing this.

If we keep the separation (sub-mods), is it possible to move files
between sub-modules?

Yes, but not with a single commit, it is “deleting code / file in repo X”, and then “adding code / file in repo Y”.
So the git log in repo Y stops at some point, where the commit message says “moving feature F from repo X”.
And then you need to close repo X, and use "git log” to find when it was taken out, and resume your blame from here.

If we do mono-repo, then the problem goes away.

Another thing I use daily is finding “when was this added”, using git
“pickaxe” (git log S ….). Which again works only inside a repository.

Uh, that looks nice. Can you share your alias? :slight_smile:

No alias, standard git: git log -S “something”

One description I found online (http://jfire.io/blog/2012/03/07/code-archaeology-with-git/ ) is:

"Pickaxes are often useful for archaeological purposes, and git’s pickaxe is no exception. It refers to the -S option to git log. The -S option takes a string parameter and searches the commit history for commits that introduce or remove that string. That’s not quite the same thing as searching for commits whose diff contains the string—the change must actually add or delete that string, not simply include a line on which it appears.”

Hi Duncan,

I don't understand your concerns.

First, the choice between sub-modules and mono-repo has been put
forward as the only two choices because people felt that, if we let it
open, we'd have too many different implementation details and we'd
never get anywhere.

So...

- how much pain the transition would cause, instead of what they think the right final state is.

The final state is defined by submod vs. monorepo, and that's
represented in a different question. Those questions are addressing
the additional work done to get there, as many have said would be the
crucial decision point.

It also outlines the cost over their preferred vs non-preferred
solutions, which leads to the aggregated cost over the whole project
for each decision.

- what's good for the individuals responding, instead of what they think is best for the LLVM project; and

That's implied. I think it is clear enough, but we can always change
the wording if others feel confused.

Secondly, I'm worried about this question: "How does the choice between a single repository with all projects and the use of sub-modules impact your usage of Git?" I'm not sure we'll good signal from this; it's essentially a vote on the two variants, but it doesn't force the respondent to think about the specific issues. I'd rather find a way to ask about the specific concerns raised in the document.

It is a vote. The "thinking" is on the extended answer that follows.
Answers with good extended reasoning will have a greater weight than
those without.

If you're worried about data mining, than leaving those questions to
full text answers will require someone to read it all, interpret, and
put their bias on top. Given the nature of this problem, we should
avoid bias whenever possible, especially when interpreting the
answers.

Thirdly, I'm worried that the follow-ups talk about "preferred" and "non-preferred" instead of "multirepo" and "monorepo". This makes data-mining non-trivial (because the meaning depends on previous answers) and increases the chance of respondent confusion.

I see your point. We can re-word to make that more clear.

4. How often do you work on a small LLVM sub-project without using a checkout of LLVM itself?
- Always.
- Most of the time.
- Sometimes.
- Never.

Interesting, it covers the main problem with both proposals.

5. Please categorize how you interact with upstream.
- I need read/write access, and I have limited disk space.
- I need read/write access, but a 1GB clone doesn't scare me.
- I only need read access.

I'm not sure that's critical. My current source repo has 35GB with
just a few worktrees.

Also, both solutions have low-disk-usage modes, and this would make no
difference on how we proceed.

6. How important is cross-project blame, grep, etc.?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

Based on other comments in the thread, we should leave this one out.

7. Single-commit cross-project refactoring designs away a class of build failures and simplifies making API changes. How important is it?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

I don't like to assert my opinion and then ask how much people agree.
I prefer to ask the question directly, like:

How often do you need to commit across repositories (ex. llvm+clang)
and how often are your builds broken because they're in separate
repositories?

Also, I think your scale of important is somewhat skewed up. Vital and
Extremely are at the top, somewhat is right bang in the middle and not
at all is the very bottom.

You either have two positive and two negative (very, somewhat, not
much, not at all) or you add a fifth in the middle. I prefer 4 because
that makes people think harder.

8. The multirepo variant provides read-only umbrella repository to coordinate commits between the split sub-project repositories using Git submodules. Assuming multirepo gets adopted, how do you expect to use the umbrella?
// checkboxes:
+ Actively contribute tooling improvements to improve it.
+ Integrate it into our downstream fork.
+ Use it for upstream contributions.
+ Use it as the primary interface development environment.
+ Use it for bisection.

Good. (+ N/A, too)

9. If multirepo is adopted, how do you plan to contribute to upstream?
- Using Git submodules.
- Using the Git repos directly.
- Using the SVN bridges.
- n/a: I don't contribute.

10. The monorepo variant provides read/write access to sub-projects via an SVN bridge and git-svn. Contributors will have the option to continue using repositories split on project boundaries. Assuming monorepo gets adopted, how do you plan to contribute?
- I'll use the monorepo as soon as it's possible, even before it's canonical.
- I'll use the monorepo as soon as it's canonical.
- I'll transition to monorepo eventually.
- I'll use the SVN bridge on separated sub-projects forever.
- I'll use a Git mirror (and/or git-svn) on separated sub-projects forever.
- n/a: I don't contribute.

11. If monorepo is adopted, how do you plan to integrate it downstream?
- We already use monorepo.
- We'll switch to pulling from monorepo during the transition period.
- We'll switch to pulling from monorepo eventually.
- We'll integrate from the SVN bridge forever.
- We'll integrate from the split sub-project Git mirror forever.
- n/a: There is no downstream.

Good.

12. The multi/mono hybrid variant merges some sub-projects, but leaves runtimes in separate repositories using the umbrella to tie them together. Is this the best or worst of both worlds?
- This is great. Native cross-project refactoring, without penalizing runtime-only developers.
- Whatever. I'll deal with it.
- This is terrible. All the transition pain of monorepo, without the advantages.

I didn't know we were proposing yet another variant. This seems like a
last minute rushed in proposal and I don't want to endorse it in the
survey. We can discuss them in the BoF, though.

13. If multirepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

14. If monorepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

Those are already covered by the current bad/good, but I'll change the
wording to be like this one.

15. If we could go back in time and restart the project with today's technologies, which repository scheme would be best for the LLVM project?
- CVS.
- Subversion repository with split sub-projects (<sub-project>/trunk), with git-svn.
- Subversion repository as a single project (trunk/<sub-project>), with git-svn.
- Git: multirepo variant.
- Git: monorepo variant.
- Git: multi/mono hybrid variant.
- Other.

Let's not put CVS in there, please. :slight_smile:

So, what's the purpose of this question? I mean, we are "starting
fresh" in a way, and the responses of the rest of the survey would
make this question irrelevant, no?

I'll be changing the wording on the ones we all agree on and leave the
ones with questions until they're all solved.

cheers,
--renato

Hi Duncan,

I don't understand your concerns.

First, the choice between sub-modules and mono-repo has been put
forward as the only two choices because people felt that, if we let it
open, we'd have too many different implementation details and we'd
never get anywhere.

So...

- how much pain the transition would cause, instead of what they think the right final state is.

The final state is defined by submod vs. monorepo, and that's
represented in a different question. Those questions are addressing
the additional work done to get there, as many have said would be the
crucial decision point.

It also outlines the cost over their preferred vs non-preferred
solutions, which leads to the aggregated cost over the whole project
for each decision.

- what's good for the individuals responding, instead of what they think is best for the LLVM project; and

That's implied. I think it is clear enough, but we can always change
the wording if others feel confused.

Secondly, I'm worried about this question: "How does the choice between a single repository with all projects and the use of sub-modules impact your usage of Git?" I'm not sure we'll good signal from this; it's essentially a vote on the two variants, but it doesn't force the respondent to think about the specific issues. I'd rather find a way to ask about the specific concerns raised in the document.

It is a vote. The "thinking" is on the extended answer that follows.
Answers with good extended reasoning will have a greater weight than
those without.

If you're worried about data mining, than leaving those questions to
full text answers will require someone to read it all, interpret, and
put their bias on top. Given the nature of this problem, we should
avoid bias whenever possible, especially when interpreting the
answers.

Thirdly, I'm worried that the follow-ups talk about "preferred" and "non-preferred" instead of "multirepo" and "monorepo". This makes data-mining non-trivial (because the meaning depends on previous answers) and increases the chance of respondent confusion.

I see your point. We can re-word to make that more clear.

4. How often do you work on a small LLVM sub-project without using a checkout of LLVM itself?
- Always.
- Most of the time.
- Sometimes.
- Never.

Interesting, it covers the main problem with both proposals.

5. Please categorize how you interact with upstream.
- I need read/write access, and I have limited disk space.
- I need read/write access, but a 1GB clone doesn't scare me.
- I only need read access.

I'm not sure that's critical. My current source repo has 35GB with
just a few worktrees.

Also, both solutions have low-disk-usage modes, and this would make no
difference on how we proceed.

This is a point of contention and a concern that Chris voiced about the monorepo. It should be in the survey.

6. How important is cross-project blame, grep, etc.?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

Based on other comments in the thread, we should leave this one out.

The point of the survey is to gather data. The fact that not much people are doing it, does not mean that after reading the proposal document they wouldn’t answer " It should be easy enough that everyone does it by default.”.

7. Single-commit cross-project refactoring designs away a class of build failures and simplifies making API changes. How important is it?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

I don't like to assert my opinion and then ask how much people agree.

I don’t see an “opinion” in the question.

I prefer to ask the question directly, like:

How often do you need to commit across repositories (ex. llvm+clang)
and how often are your builds broken because they're in separate
repositories?

Asking it this way does not allows someone to answer " It should be easy enough that everyone does it by default.”.

I prefer Duncan’s wording.

Also, I think your scale of important is somewhat skewed up. Vital and
Extremely are at the top, somewhat is right bang in the middle and not
at all is the very bottom.

You either have two positive and two negative (very, somewhat, not
much, not at all) or you add a fifth in the middle. I prefer 4 because
that makes people think harder.

8. The multirepo variant provides read-only umbrella repository to coordinate commits between the split sub-project repositories using Git submodules. Assuming multirepo gets adopted, how do you expect to use the umbrella?
// checkboxes:
+ Actively contribute tooling improvements to improve it.
+ Integrate it into our downstream fork.
+ Use it for upstream contributions.
+ Use it as the primary interface development environment.
+ Use it for bisection.

Good. (+ N/A, too)

9. If multirepo is adopted, how do you plan to contribute to upstream?
- Using Git submodules.
- Using the Git repos directly.
- Using the SVN bridges.
- n/a: I don't contribute.

10. The monorepo variant provides read/write access to sub-projects via an SVN bridge and git-svn. Contributors will have the option to continue using repositories split on project boundaries. Assuming monorepo gets adopted, how do you plan to contribute?
- I'll use the monorepo as soon as it's possible, even before it's canonical.
- I'll use the monorepo as soon as it's canonical.
- I'll transition to monorepo eventually.
- I'll use the SVN bridge on separated sub-projects forever.
- I'll use a Git mirror (and/or git-svn) on separated sub-projects forever.
- n/a: I don't contribute.

11. If monorepo is adopted, how do you plan to integrate it downstream?
- We already use monorepo.
- We'll switch to pulling from monorepo during the transition period.
- We'll switch to pulling from monorepo eventually.
- We'll integrate from the SVN bridge forever.
- We'll integrate from the split sub-project Git mirror forever.
- n/a: There is no downstream.

Good.

12. The multi/mono hybrid variant merges some sub-projects, but leaves runtimes in separate repositories using the umbrella to tie them together. Is this the best or worst of both worlds?
- This is great. Native cross-project refactoring, without penalizing runtime-only developers.
- Whatever. I'll deal with it.
- This is terrible. All the transition pain of monorepo, without the advantages.

I didn't know we were proposing yet another variant. This seems like a
last minute rushed in proposal and I don't want to endorse it in the
survey. We can discuss them in the BoF, though.

We're not “endorsing” anything in the survey. We’re collecting data to help driving the BoF discussing the proposal document.

Before starting the survey design I stated that we should first have the proposal document ready, and the survey should ask the relevant question with respect to the proposal.

Also, this “variant” was discussed very early when the monorepo proposal came out.

13. If multirepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

14. If monorepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

Those are already covered by the current bad/good, but I'll change the
wording to be like this one.

15. If we could go back in time and restart the project with today's technologies, which repository scheme would be best for the LLVM project?
- CVS.
- Subversion repository with split sub-projects (<sub-project>/trunk), with git-svn.
- Subversion repository as a single project (trunk/<sub-project>), with git-svn.
- Git: multirepo variant.
- Git: monorepo variant.
- Git: multi/mono hybrid variant.
- Other.

Let's not put CVS in there, please. :slight_smile:

So, what's the purpose of this question? I mean, we are "starting
fresh" in a way, and the responses of the rest of the survey would
make this question irrelevant, no?

We’re not “starting fresh”: we have downstream user integrating the repos, we have bug tracker referencing revisions, we have tooling (LNT, llvm-bisect …).
The sense of the question is “making abstraction of the pain of the transition” what is the “ideal” environment for developing LLVM.

This is a point of contention and a concern that Chris voiced about the monorepo. It should be in the survey.

A lot of concerns were voiced on the discussion, not all of them here.

Hasn't this particular point been solved by shallow checkouts?

Chris, are you still worried about disk size on a mono-repo vs. sub-modules?

The point of the survey is to gather data. The fact that not much people are doing it, does not mean that after reading the proposal document they wouldn’t answer " It should be easy enough that everyone does it by default.”.

We can go on and on about many topics, but the more we put in, the
harder it will be to make sense of things. Unless the question is
critical to the problem at hand, which I don't believe it is, we
should avoid bloating the survey.

As I said to Chris L. before, we can have a complete survey that will
take a lot of time to answer and will give us wonderful data over the
corse of months, and we can have a quick survey to feed the BoF
discussion, but we can't have both.

7. Single-commit cross-project refactoring designs away a class of build failures and simplifies making API changes. How important is it?

I don’t see an “opinion” in the question.

Perhaps I should have said "a point of view".

Asking it this way does not allows someone to answer "It should be easy enough that everyone does it by default.”.

I made a scale: must fix / could fix / doesn't matter.

We're not “endorsing” anything in the survey. We’re collecting data to help driving the BoF discussing the proposal document.

The deal was to collect what's proposed only, and we're not (or should
not be) proposing a third alternative which won't have time to be
discussed.

Before starting the survey design I stated that we should first have the proposal document ready, and the survey should ask the relevant question with respect to the proposal.

We had the first proposal agreed and documented one month before the
survey first appeared. The second proposal is still not ready and we
won't have time to do a third.

Also, this “variant” was discussed very early when the monorepo proposal came out.

Many variants were proposed for the sub-modules, too and only one
survived. Again, that was the "deal" we all reached in the review to
make the proposal and the survey more manageable.

We’re not “starting fresh”: we have downstream user integrating the repos, we have bug tracker referencing revisions, we have tooling (LNT, llvm-bisect …).
The sense of the question is “making abstraction of the pain of the transition” what is the “ideal” environment for developing LLVM.

I see your point. I'll add this question.

cheers,
--renato

Hi,

Thanks a lot Duncan, I really like this! I totally support adopting this scheme now. See inline a few quite minor comments.

Renato: are you still interested and available now to set-up the survey? We should close on this *this week*.

Folks,

After feedback from Chris and Mehdi, I have added one long text answer
to *each* critical questions (impact on productivity), so that people
can extend their reasoning.

But I have not made them compulsory, so that people that don't know
much about or don't have any problem don't feel compelled to expand on
nothing.

https://docs.google.com/forms/d/e/1FAIpQLSc2PBeHW-meULpCOpmbGK1yb2qX8yzcQBtT4nqNF05vSv69WA/viewform

The bottom line is, answers with more substance will be taken more
seriously than those without, but the numbers should also tell us
something.

For example, if 9/10 of the answers are "we don't care", than the
other 1/10 will have less weight than if it was 1/2, but still
important to any decision.

Can we go live with what we have now?

Mehdi, how's the document? Can we get that online so I can change the
header of the survey?

Hi Renato,

Thanks very much for putting this together.

I think the proposal document is almost finished now. Since I ended up reviewing it pretty thoroughly, I've gained a bit of understanding about the concerns we need input on.

The survey is a great start, but the final page isn't quite addressing the concerns in the final proposal. I'm not sure it asks the right questions to focus the conversation at the BoF.

Firstly, I'm concerned that the questions focus on:
- what's good for the individuals responding, instead of what they think is best for the LLVM project; and
- how much pain the transition would cause, instead of what they think the right final state is.

Secondly, I'm worried about this question: "How does the choice between a single repository with all projects and the use of sub-modules impact your usage of Git?" I'm not sure we'll good signal from this; it's essentially a vote on the two variants, but it doesn't force the respondent to think about the specific issues. I'd rather find a way to ask about the specific concerns raised in the document.

Thirdly, I'm worried that the follow-ups talk about "preferred" and "non-preferred" instead of "multirepo" and "monorepo". This makes data-mining non-trivial (because the meaning depends on previous answers) and increases the chance of respondent confusion.

I spent some time today thinking through what set of questions would get us the data we want.
- I've focused on the main concerns about (and benefits of) the two variants.
- I've referred to the multirepo and monorepo by name (consistently) in questions asking about them. This ensures that people know exactly what they're answering.
- I've added specific questions about how people plan to use the multirepo and monorepo, so that we know which tooling is most important (and also to determine how worried we should be about some of the concerns).
- I've moved the "vote"-like question to the end to force respondents to think through the issues first. I've also restricted "the vote" to "the ideal project setup", so we can clearly separate that from "transition pain". (I'm still not sure the vote will have much value, but it doesn't hurt.)

Here are my suggested questions; feedback welcome:

----

1. How do you use LLVM?
// as is

2. Which projects do you contribute to / use?
// as is

For this last question, I’d keep a check-boxes list with an optional blank field. It seems that the list of projects being limited, checking boxes is both easier for the people answering and for us doing the data-mining.
(What Renato has set-up looks good to me here)

I agree that what Renato has set up looks good. That's what I meant from "as is", but I wasn't clear.

3. Use this field to expand on your usage, if necessary
// as is

4. How often do you work on a small LLVM sub-project without using a checkout of LLVM itself?
- Always.
- Most of the time.
- Sometimes.
- Never.

5. Please categorize how you interact with upstream.
- I need read/write access, and I have limited disk space.
- I need read/write access, but a 1GB clone doesn't scare me.
- I only need read access.

6. How important is cross-project blame, grep, etc.?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

7. Single-commit cross-project refactoring designs away a class of build failures and simplifies making API changes. How important is it?
- Vital. I already use SVN/monorepo/custom-tooling to accomplish this.
- Extremely. It should be easy enough that everyone does it by default.
- Somewhat. I would use it if it were easy, but it's just nice to have.
- Not at all. Anyone who cares can write their own tooling.

8. The multirepo variant provides read-only umbrella repository to coordinate commits between the split sub-project repositories using Git submodules. Assuming multirepo gets adopted, how do you expect to use the umbrella?
// checkboxes:
+ Actively contribute tooling improvements to improve it.
+ Integrate it into our downstream fork.
+ Use it for upstream contributions.
+ Use it as the primary interface development environment.
+ Use it for bisection.

9. If multirepo is adopted, how do you plan to contribute to upstream?
- Using Git submodules.
- Using the Git repos directly.
- Using the SVN bridges.
- n/a: I don't contribute.

Can you clarify what “Using Git submodules” mean?
Since the umbrella is read-only, it is not clear to me.
Removing this first answer and keeping the others makes sense to me on the other hand.

Day-to-day work:
- Are you checking out the Git umbrella, using git-submodules, and hacking on the submodules within it?
- Are you checking out the split repos directly?
- Are you checking out the SVN bridges directly?
- n/a

Does that help to clarify? How can we adjust the wording to make it clear?

10. The monorepo variant provides read/write access to sub-projects via an SVN bridge and git-svn. Contributors will have the option to continue using repositories split on project boundaries. Assuming monorepo gets adopted, how do you plan to contribute?
- I'll use the monorepo as soon as it's possible, even before it's canonical.
- I'll use the monorepo as soon as it's canonical.
- I'll transition to monorepo eventually.
- I'll use the SVN bridge on separated sub-projects forever.
- I'll use a Git mirror (and/or git-svn) on separated sub-projects forever.
- n/a: I don't contribute.

11. If monorepo is adopted, how do you plan to integrate it downstream?
- We already use monorepo.
- We'll switch to pulling from monorepo during the transition period.
- We'll switch to pulling from monorepo eventually.
- We'll integrate from the SVN bridge forever.
- We'll integrate from the split sub-project Git mirror forever.
- n/a: There is no downstream.

12. The multi/mono hybrid variant merges some sub-projects, but leaves runtimes in separate repositories using the umbrella to tie them together. Is this the best or worst of both worlds?
- This is great. Native cross-project refactoring, without penalizing runtime-only developers.
- Whatever. I'll deal with it.
- This is terrible. All the transition pain of monorepo, without the advantages.

13. If multirepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

14. If monorepo is adopted, how much pain will there be in your transition?
- Nothing consequential.
- A little; but it'll be fine.
- A lot; but it'll get done somehow.
- Too much; I/we may stop contributing to LLVM.

For the two previous question, I’d add an answer “n/a: I don’t contribute”.
(We keep readonly views in both cases, not clear to me how it affects non-contributors).

SGTM.

Btw, I now split into multiple pages, to make it less daunting, so
I've added this question at the "usage questions" page, but phrased in
a more generic way as "cross-repo usage" (example, bisect, blame,
etc).

cheers,
--renato