"Hidden emails" on GitHub: should we do something about it?

GitHub has a setting to “keep your email private”, which impacts operations done through the Web UI, including merging a pull-request.

Here is the impact in the git log:

$  git log  | grep users.noreply.github.com | sort -u 
    406f1e8211f8f5017f44f46af750dec061e707a2 by Zhang <5205699+Naville@users.noreply.github.com>:
Author: 5chmidti <44101708+5chmidti@users.noreply.github.com>
Author: Aaron Jarmusch <61516473+ajarmusch@users.noreply.github.com>
Author: Aart Bik <39774503+aartbik@users.noreply.github.com>
Author: Abhina Sree <69635948+abhina-sree@users.noreply.github.com>
Author: Abhinav271828 <71174780+Abhinav271828@users.noreply.github.com>
Author: Adam Straw <26910851+adstraw@users.noreply.github.com>
Author: AdityaK <1894981+hiraditya@users.noreply.github.com>
...
Author: Zain Jaffal <110604482+zjaffal@users.noreply.github.com>
Author: Zhang <5205699+Naville@users.noreply.github.com>
Author: zyn0217 <117258288+zyn0217@users.noreply.github.com>
$  git log  | grep users.noreply.github.com | sort -u  | wc  -l
238

The same users pushing commits directly to main wouldn’t see the email hidden.
When merging a PR for someone else, if I check it out locally and push the commit, it would likely also see the original authorship preserved.

So: is this OK or should we do something about it?
An easy step could be to have one of the action bot add a warning comment in the PR when the author has this setting enabled.

6 Likes

In my opinion (and seems to be shared by a significant number on Discord), it’s not, and we should. One notable point brought up was that it makes it hard to find contact addresses authors for re-licensing purposes, as has been done in the past (hopefully not again, but…).

3 Likes

I’ve been trying to think this through. So far, I think that:

  • It is important to be able to reach authors sometimes years later. Is a github user name good enough compared to an email address? With a github user name, we’re relying on github being around years from now. With an email address, we’re relying on the email address still being in use by the person years in the future. I think both are not perfect; but I wouldn’t say one is clearly better than the other.
  • I do think that it is important in some cases to be able to send a private/confidential message to an author in some circumstances. Email addresses obviously enable that. I’m not sure if it’s possible to send a confidential message to someone if you only know their github user name?

Overall, my feel is that anonymous email addresses might be OK if there is a possibility to send a confidential message to someone when you only know their github user name. If that is not possible, I think it would be best to enforce authors sharing a working email address in the commit log.

3 Likes

One of the downsides with hiding email addresses is that it is no longer obvious if a contribution is owned by a corporation rather than an individual acting in their own time. For example, it would no longer be obvious that my contributions are owned by Sony if my email address was hidden. This information was important for the relicensing efforts, but I can imagine there might be other cases where it is important too.

1 Like

Another issue with this approach is that the committers won’t be notified by the buildbots if a build fails since, as far as I can tell, these emails are not deliverable.

For this last issue, there was a discussion about failing buildbots publishing a comment in the PR (which has the advantage that reviewers would also be notified)

I have a lot of random things going on via GitHub, and so I prefer to not have my primary GitHub account email associated with everything I do by default. I am OK having my corporate email associated with my MLIR commits, but it would be lame to leave my job and lose the association of that work with GitHub account, which I would expect to last longer than my employment.

I thought it was sufficient to add a repository-local git config to associate a per-repo email address, so long as the locally set email is associated with your GH account in the GH settings. The official GH docs explain this at Setting your commit email address - GitHub Docs

However, in [mlir] add debug messages explaining failure reasons of isValidIntOrFloat by j2kun · Pull Request #69476 · llvm/llvm-project · GitHub the UI seems to claim that the authorship would have been overwritten when squash+merging. Indeed, I tested this on a different repository and the authorship info is rewritten. However, it does NOT rewrite author information when the merge is a rebase+merge. If the convention is to have one commit per PR, then a rebase+merge strategy might help.

1 Like

FYI you don’t have to remove a work email from GitHub when you leave, for example this commit from me 6 years ago with my @apple.com email is still associated with my GitHub user there (and that was done on SVN before the LLVM project moved to GitHub :wink: )

Also: my work email is not my “default” in my profile. But I don’t know if it’ll impact what would be used when I merge one of your PR? GitHub may be smart enough to use your commit email…

Seems like lot of requests for work/personal customization:

1 Like

I also posed this question to the GH devs: Is it possible to "expose your email" on a per-repository basis? · community · Discussion #70582 · GitHub

There is also Emails should have an individual "private" option · community · Discussion #58127 · GitHub for selecting which emails can be exposed if you have multiple.

I’m bumping this topic, this becomes more problematic.
Problems that I seen:

Scenario A: (Example: #71974)
Pull request contain commits from an real email and users.noreply.github.com email. During merge only users.noreply.github.com email is used. Information about original email is lost.

Maybe CI could detect this somehow and provide some warning to user.

Scenario B
Commits in pull requests are from email “ABC”, but user want to merge this as “XYZ”. In such case merge need to be done manually & information about state “merged” is lost in pull requests, such request is set as “closed”.

Scenario C
Email provided by user is invalid. Should such PR’s be somehow blocked ?

This issue could be problematic when some contributions supposed to be done from a corporate email, but instead merges are done as users.noreply.github.com.

Maybe CI should provide some warnings, and maybe users should be able to self-merge already approved PRs, so they could see an email and be able to select correct one, otherwise person that do squash and merge should be able to adjust author.

I know this was discussed a little bit on Discord a few weeks ago, but do we actually have a technical solution to prevent commits with these emails or do we need to implement something?

We would need to write a github action detecting this I think?

What do we need to look for? A PR author with a hidden email address?

I think the “hide email” feature of GitHub rewrites the email address when the PR is merged…

Isn’t there an API to query information about a PR author? Wouldn’t we get the anonymous email in this case?

This should do it:

import github
import sys

token = sys.argv[1]
pr_number = int(sys.argv[2])
gh = github.Github(login_or_token=token)
if not gh.get_organization('llvm').get_repo("llvm-project").get_pull(pr_number).user.email:
    print("Private Email")

Once we’ve detected this, what action should we take? I don’t think there is any way to block the merge. Should we just add a comment?

If we want that people stop using this feature, I would ask them to in a comment. Can were make it similar to clang-format failing (that is a “check fail” mark)?

One possible way to handle the issue is to require DCO (Developer Certificate of Origin). And there are even bots that can check and enforce this, e.g. https://probot.github.io/apps/dco/

They won’t enforce the commit author I believe, but instead some comment in the commit description saying “signed-off-by”. That may be an alternative to offer to folks who want their email in the description instead of the Git metadata, but I’m not sure we should just mandate that to everyone for solving this thread (if we want this for another reason, that’s another discussion of course)