Codeface: Open-Source Developer Study

Dear Open-Source Developers,

The University of Passau is currently studying the mechanisms that contribute to effective collaboration in open-source projects so that appropriate tools and techniques are created to support the needs of open-source developers. To achieve this goal, we are evaluating the usefulness of software archives (e.g., mailing lists, version-control system, and bug tracker) to quantitatively model social relationships and collaborative patterns in open-source projects. The culmination of this effort is an open-source project called Codeface, which is a framework and web front-end for analyzing social and technical aspects of software development. To learn more about Codeface, please visit http://siemens.github.io/codeface.

We are now recruiting open-source developers to participate in a short survey. The survey is composed of 4 questions and takes about *5 minutes* to complete. We ensure that all of your information will be kept confidential and will only be used for scientific research purposes. There is no commercial interest in the results. We are merely interested in learning about your collaborative experiences as an open-source developer.

To access the survey, please click on the link below.
http://rfhinf067.hs-regensburg.de:8080/login.html

We highly appreciate your efforts, and we sincerely hope that you will take the time to participate. Upon completion of the survey, you may include your e-mail address so that we can send you the anonymized survey results.

We would also like to express our gratitude for support from Siemens Corporate Research and University of Applied Sciences Regensburg.

Sincerely,

Mitchell Joblin

PhD Student
Department of Informatics and Mathematics
University of Passau

Hi,

I started doing the survey, but I stopped when it required manually listing the developers that I collaborated with and the nature of the collaboration (and it seems that participating in a review or discussion thread with that developer requires listing them).

I believe that this is an overly onerous requirement, since the number of developers that I would have to list would be very large and it would take me a long time to inventory the nature of the collaboration. For example, on just one page of my “sent” email I can see at least 10 developers that I would have to manually list and analyze the nature of the thread. I assume that most LLVM developers would be in a similar boat.

For this survey to be more realistic, I would recommend that you mine the mailing list archives (and maybe also SVN and bugzilla) to develop preliminary information, then use that to pre-populate the contents of question 2 (after the survey-taker has given their emails); you also may want to present a listing of thread titles with the ability to click through to show the mailing list thread for further inspection.

Also keep in mind that a number of us have been (or still are) involved with LLVM from multiple email addresses, so the system must be able to take this into account. For example, a survey-taker should be able to specify multiple email addresses that are associated with them, and probably also should be able to say “these two rows of question 2 are actually the same person”.

Also, it’s not clear what version number we should put in; the page seems to suggest using e.g. v3.0 if you are a current LLVM developer, but that doesn’t make sense because we have been in v3.x for a very long time.
It doesn’t really matter though, because LLVM development does not revolve around releases and so using version numbers to identify anything about developer involvement doesn’t make sense. I would recommend just using the time of involvement, ideally pre-populated from mailing list information.

– Sean Silva

Hi Sean,

First, thanks for your helpful comments.

Quoting Sean Silva <chisophugis@gmail.com>:

Hi,

I started doing the survey, but I stopped when it required manually listing
the developers that I collaborated with and the nature of the collaboration
(and it seems that participating in a review or discussion thread with that
developer requires listing them).

I agree, for highly active people entering every name is simply too time consuming. In this case, I suggest that you mention some of the key people that immediately come to mind and forget about trying to exhaustively list every person. We recognized that achieving high recall is unlikely so we anyways are happy to achieve high precision. Even the few names that you mention are very helpful to us. I will change the question text to reflect this point, but I hope we can still get a response from you even if it is not completely representative of all your collaborative relationships.

I believe that this is an overly onerous requirement, since the number of
developers that I would have to list would be very large and it would take
me a long time to inventory the nature of the collaboration. For example,
on just one page of my "sent" email I can see at least 10 developers that I
would have to manually list and analyze the nature of the thread. I assume
that most LLVM developers would be in a similar boat.

For this survey to be more realistic, I would recommend that you mine the
mailing list archives (and maybe also SVN and bugzilla) to develop
preliminary information, then use that to pre-populate the contents of
question 2 (after the survey-taker has given their emails); you also may
want to present a listing of thread titles with the ability to click
through to show the mailing list thread for further inspection.

Yes, this is an important point. We do already mine the version-control system and provide a drop down menu but there could be people missing from the list for a variety of technical reasons. Name aliasing is an issue and we try to resolve a single identifier for these multiple aliases but it is naturally base on heuristics that do not always work correctly. We have also intentionally not attempted to provide a small list of developers (e.g., people you already exchanged an email with) to avoid introducing a bias. These things of course always involve a trade-off as you have pointed out.

Speaking of the mailing list, I am currently analyzing the LLVM mailing list. For anyone that is interested and seeing the networks build up from this data should include their email in the survey and I can send these to you.

Also keep in mind that a number of us have been (or still are) involved
with LLVM from multiple email addresses, so the system must be able to take
this into account. For example, a survey-taker should be able to specify
multiple email addresses that are associated with them, and probably also
should be able to say "these two rows of question 2 are actually the same
person".

Also, it's not clear what version number we should put in; the page seems
to suggest using e.g. v3.0 if you are a current LLVM developer, but that
doesn't make sense because we have been in v3.x for a very long time.
It doesn't really matter though, because LLVM development does not revolve
around releases and so using version numbers to identify anything about
developer involvement doesn't make sense. I would recommend just using the
time of involvement, ideally pre-populated from mailing list information.

It would have been sufficient to enter v3.x.x to whatever precision you like. We just need to limit the temporal period to avoid the error of trying to find collaborative relationships that are separated by 5 years or some very large period. Perhaps for LLVM the version reference to a time period does not work well, I originally expected that developers would think in terms of releases and not dates. I will add the option to enter a date in the login page to help with this problem.

Thanks again for the comments.

-- Mitchell Joblin