Subprojects, GitHub, and the Monorepo

I work on clangd, the language server/IDE backend in clang/tools/extra.

Clangd is at a stage where the core functionality is stable and useful enough that we want to put it front of more users. I’ve been spending time recently thinking about user-facing things: packaging, mailing lists, docs, bugtracking.

And I think we should do much of this on GitHub, rather than *.llvm.org. And not in the upcoming monorepo, but in a separate repository. (e.g. github.com/llvm/clang)

I expect this to be controversial. It’s definitely community fragmentation. I think the reasons to do it for clangd are strong, but they won’t apply equally to all projects. And I’d like to know what people think. So here’s my reasoning.

Point 1: It’s what people expect.
Everyone knows how to use the Github bug tracker, and has a Github account.
Everyone knows markdown, how to edit-and-preview, and how to send a doc pull request.
Everyone has these workflows in their muscle memory when a github project is the top websearch result.
(Current LLVM developers also know the LLVM equivalents, but that’s a small group).
This is largely why we’re moving the code to Github, too.

Point 2: exposing the LLVM monolith is bad for users.
Clangd’s customers don’t care about the structure of the LLVM umbrella project, or even that it exists.
If they search for clangd on the web, they want to find a tree that looks like this:
clangd

  • features
  • installation
  • bugs
  • code
    Not like this:
    llvm.org
  • docs
    – lldb, etc
    – clang
    — features, etc
    — tools
    ---- clang-tidy, etc
    ---- clangd
    ----- features
    ----- installation
  • bugs
    – lldb, etc
    – clang
    — tools
    ---- clang-tidy, etc
    ---- clangd
  • code
    – lldb, etc

    LLVM’s source repository is monolithic for technical reasons (versioning), but we that’s not a strong reason that the bug trackers, documentation etc should be monolithic. Spraying hyperlinks around won’t fix the fact that the website is the wrong shape.

Point 3: the tools are just better.
I have nothing but respect and gratitude for the people that admin bugzilla, wrangle CMake and sphinx to generate docs, and keep mailman running. But unsurprisingly the state of the art has moved on, and the equivalents are in my experience easier to use, faster, and more reliable.
Symptoms of this are people routing around the tools: LLDB doesn’t use sphinx for docs, sanitizers don’t use bugzilla.
I’m sure there’s going to be some agreement and disagreement on this point :slight_smile:

Point 4: but the tools are designed for smaller, focused repositories
The “github-native” community is mostly using fairly narrowly scoped repositories, and the tools work better this way. For example, labels are enough to organize issues in a project the size of clangd, but too lightweight if the scope is LLVM and all subprojects.

What does the logical conclusion of this look like?
I don’t know. I suspect other subprojects in a similar boat may independently come to the same conclusion. Projects that have e.g. lots of bug history will need a migration story.
None of this mitigates the need for a source monorepo, so we’d be stuck with all the code in llvm/llvm and just issues/docs in llvm/clangd. Not ideal, but manageable.
Clangd is a pretty easy case, so I don’t know if this makes it a good trial or a bad one.

<dons flame-retardant suit>
What do you all think?

I think these issues you raise are all basically fixable, without fragmenting the community.

Bugtracker

Yes, everyone has a github account – but I think everyone could also use bugzilla easily enough, if we add a “Login with github” button. I would like someone to volunteer to work on getting bugzilla patched with that functionality. (Someone, please volunteer!)

Website, and raw HTML vs markdown.

Clangd doesn’t have a website now, in the same sense as clang, llvm and other llvm projects do. It just has a little bit of sphinx documentation hidden deep within clang’s docs. I think this is a large part of the frustration you raise, but it’s not a necessary part of living under the llvm project umbrella. I think you probably want to make clangd have its own actual landing page.

As for the authoring format, IMO it’s definitely a good idea to start migrating to using markdown for the website, and migrate to github pages autogeneration and hosting. (But I think that should be done project-wide, not one-off.)

Exposing LLVM community identity

Clangd is part of the LLVM community, and having a community identity can be a good thing. But, indeed, the projects should also have their own identities within that.

You’ve expressed “Users don’t want to know that clangd is part of the llvm community”. But I think that’s exactly wrong. We should be exposing that fact and pushing that identity, and I don’t think that, by itself, bothers any uesrs.

However, I think I see a different underlying issue hiding in that stated concern: finding clangd in the current website is too confusing. Having clangd be part of the LLVM community – or part of the LLVM website – isn’t the issue problem. Having clangd not be easily accessible on the LLVM website is a problem.

It could have higher visibility, even while being “part of the LLVM project”, and represented and structured as such.

Path forward

Instead of starting up a parallel infrastructure, I’d suggest that the way forward should be:

  1. Keep (and make minor enhancement to) bugzilla. Add a clangd component to bugzilla.
  2. Add a more user-facing, top-level, clangd landing page within the llvm website, like we have for other projects, to give the project higher visibility. With links to code, bugtracker (e.g. link directly to the search/enter-bug urls in the clangd component), as appropriate.
  3. Start converting the LLVM website to a github pages site.

Note that github pages allows a mixture of html, markdown, and jekyll site generator features, and automatically updates upon commits to the repository. I think a fairly extensive but still relatively simple example of a bunch of functionality is www.mono-project.org, autogenerated from <https://github.com/mono/website>.

(I don’t think the restructured-text “docs” directories necessarily need be migrated; for the most part those are somewhat of a different kind of thing than the “website-proper”, and could be left as is, at least for now.)

Website migration to git/github-pages does not need to wait for the entire svn->git repository migration process to finish! We could pretty easily decide to make the “www” git repository canonical first, before the code.

I work on clangd, the language server/IDE backend in clang/tools/extra.

Clangd is at a stage where the core functionality is stable and useful enough that we want to put it front of more users. I’ve been spending time recently thinking about user-facing things: packaging, mailing lists, docs, bugtracking.

And I think we should do much of this on GitHub, rather than *.llvm.org. And not in the upcoming monorepo, but in a separate repository. (e.g. github.com/llvm/clang)

Can’t we / shouldn’t we create decomposed repos by partitioning commits made to the monorepo into corresponding ones made to per-project repo “mirrors”? Certainly if we can, we should. Seems to me that even monorepo commits spanning projects could be broken down and applied to independent repos. Perhaps it’s slightly more interesting when top-level content starts to be more common, but an “llvm-top” or “llvm-general” individual repo seems feasible. Even if they’re not the canonical repos for submitting changes, it’s still a valuable idea.

I expect this to be controversial. It’s definitely community fragmentation. I think the reasons to do it for clangd are strong, but they won’t apply equally to all projects. And I’d like to know what people think. So here’s my reasoning.

Point 1: It’s what people expect.
Everyone knows how to use the Github bug tracker, and has a Github account.
Everyone knows markdown, how to edit-and-preview, and how to send a doc pull request.
Everyone has these workflows in their muscle memory when a github project is the top websearch result.
(Current LLVM developers also know the LLVM equivalents, but that’s a small group).
This is largely why we’re moving the code to Github, too.

Point 2: exposing the LLVM monolith is bad for users.
Clangd’s customers don’t care about the structure of the LLVM umbrella project, or even that it exists.
If they search for clangd on the web, they want to find a tree that looks like this:

Not sure which “users” we’re referring to but if we’re talking about ones who use only binaries provided by llvm project, they wouldn’t necessarily have to know or care anything about git, svn, github, development trees, or anything else.

LLVM’s source repository is monolithic for technical reasons (versioning), but we that’s not a strong reason that the bug trackers, documentation etc should be monolithic. Spraying hyperlinks around won’t fix the fact that the website is the wrong shape.

The monorepo is flat, yes? Is this satisfactory, or did I misunderstand your point? Is this about the tree or the organization of the website (or development process)? The website design could be very much orthogonal from the source control hierarchy. http://clang.llvm.org/ shows a page describing the compiler. Should http://clangd.llvm.org/ be created to describe clangd? Would that suffice?

What does the logical conclusion of this look like?

I don’t know. I suspect other subprojects in a similar boat may independently come to the same conclusion. Projects that have e.g. lots of bug history will need a migration story.
None of this mitigates the need for a source monorepo, so we’d be stuck with all the code in llvm/llvm and just issues/docs in llvm/clangd. Not ideal, but manageable.
Clangd is a pretty easy case, so I don’t know if this makes it a good trial or a bad one.

I think it would’ve been nice if instead we could have gone with llvm-repo-composed-of-submodules, but it was proposed, discussed and several good reasons were given for why it wasn’t preferred. shrug better a consensus on the git monorepo than years more of svn IMO.

And I think we should do much of this on GitHub, rather than *.llvm.org. And not in the upcoming monorepo, but in a separate repository. (e.g. github.com/llvm/clang)
Of course, I meant github.com/llvm/clangd here.

I think these issues you raise are all basically fixable, without fragmenting the community.

Bugtracker

Yes, everyone has a github account – but I think everyone could also use bugzilla easily enough, if we add a “Login with github” button. I would like someone to volunteer to work on getting bugzilla patched with that functionality. (Someone, please volunteer!)

This is a necessary improvement, but not a sufficient one:

  • our users still won’t know how to use bugzilla
  • bolted-on github account support is second-class (e.g. cc lists are still email addresses, @mentions don’t work)
  • bugzilla’s UI is atrocious (hard to summarize; happy to go into this if there’s real disagreement)
  • it’s not feasible to run an instance per subproject, so e.g. subprojects have no ability to define their own labels/keywords

Website, and raw HTML vs markdown.

Clangd doesn’t have a website now, in the same sense as clang, llvm and other llvm projects do. It just has a little bit of sphinx documentation hidden deep within clang’s docs. I think this is a large part of the frustration you raise, but it’s not a necessary part of living under the llvm project umbrella. I think you probably want to make clangd have its own actual landing page.

As for the authoring format, IMO it’s definitely a good idea to start migrating to using markdown for the website, and migrate to github pages autogeneration and hosting. (But I think that should be done project-wide, not one-off.)

This all sounds right. A separate landing page is the right thing. Having spent some time attempting this, I would certainly like avoid sphinx (including its markdown plugns).

Can you elaborate on why this should be done all in one go? Currently the various top-level sites are basically islands, and seem well-suited for piecemeal migration.
And for projects adding new documentation, forcing a choice between investing in a dead-end and migrating the world feels… unpleasant.

Exposing LLVM community identity

Clangd is part of the LLVM community, and having a community identity can be a good thing. But, indeed, the projects should also have their own identities within that.

You’ve expressed “Users don’t want to know that clangd is part of the llvm community”. But I think that’s exactly wrong. We should be exposing that fact and pushing that identity, and I don’t think that, by itself, bothers any uesrs.

Certainly the website needs to say “clangd is built on Clang, and is part of the LLVM project”.

But there shouldn’t be any natural path from clangd landing page to “all LLVM bugs” (other than explicitly clangd → llvm → bugs).
That does indeed bother users.

If you look at lldb.llvm.org, the “bug reports” link links to LLVM bugzilla. Same for LLD.
clang-analyzer links to enter_bug.cgi?product=clang. The natural path to seeing all bugs is clicking “home” or “browse”, which takes you to all LLVM bugs.
These are not coincidences, this is the fundamental navigation structure, and patching it only makes it more confusing.

However, I think I see a different underlying issue hiding in that stated concern: finding clangd in the current website is too confusing. Having clangd be part of the LLVM community – or part of the LLVM website – isn’t the issue problem. Having clangd not be easily accessible on the LLVM website is a problem.

I disagree here. Our audience isn’t people browsing around llvm.org, it’s people typing “clangd” into google, and the (limited) docs are the top result.
The problem is getting lost once you’re there.

Path forward

As described above, I don’t think the path described actually addresses the problems that clangd has.
It seems like a reasonable direction for parts of the project that are well-served by the current structure, though.

Brian Cain wrote:

Can’t we / shouldn’t we create decomposed repos by partitioning commits made to the monorepo into corresponding ones made to per-project repo “mirrors”?

Not sure which “users” we’re referring to but if we’re talking about ones who use only binaries provided by llvm project, they wouldn’t necessarily have to know or care anything about git, svn, github, development trees, or anything else.

This thread is about the website, documentation, bugtrackers etc. Binary-only users care about those.
I think the source-layout topics have been pretty well covered elsewhere.

I think these issues you raise are all basically fixable, without fragmenting the community.

Bugtracker

Yes, everyone has a github account – but I think everyone could also use bugzilla easily enough, if we add a “Login with github” button. I would like someone to volunteer to work on getting bugzilla patched with that functionality. (Someone, please volunteer!)

This is a necessary improvement, but not a sufficient one:

  • our users still won’t know how to use bugzilla
  • bolted-on github account support is second-class (e.g. cc lists are still email addresses, @mentions don’t work)
  • bugzilla’s UI is atrocious (hard to summarize; happy to go into this if there’s real disagreement)
  • it’s not feasible to run an instance per subproject, so e.g. subprojects have no ability to define their own labels/keywords

also:

(Current LLVM developers also know the LLVM equivalents, but that’s a small group).

I don’t think anyone likes or wants to use or administer Bugzilla. It’s just what we have right now and we don’t as a community have bandwidth to figure out what to do until we’ve made the repository move. When we have bandwidth, we should just fix this across the project. Probably after we move the repo; possibly with GitHub issues, but I think it should be the same across the project.

I can see how having a completely separate bug tracker could help a little, but I’m skeptical there are major benefits over putting each tool that has a distinct userbase at the top-level.

  • clang
  • clangd
  • (whatever else)
  • llvm

Yes, everyone has a github account -- but I think everyone could also use bugzilla easily enough, if we add a "Login with github" button. I would like someone to volunteer to work on getting bugzilla patched with that functionality. (Someone, please volunteer!)

I'd only put the effort if we aim to continue it for a long time after
the GitHub migration. I don't mind bugzilla, github issues, or any
other, but migrations are always messy and we're already doing a big
one.

Adding login with GitHub won't stop spam and will have bad
integration, trouble of managing previous emails from older posts for
the same person, etc.

- our users still won't know how to use bugzilla

Does anyone?

- bolted-on github account support is second-class (e.g. cc lists are still email addresses, @mentions don't work)

Yup.

- bugzilla's UI is atrocious (hard to summarize; happy to go into this if there's real disagreement)

Bugzilla is 20 years old. It was done before the dot-com crash when
every click was a new page, with all the context issues from the
times, and it's still basically the same.

For the same 20 years, I have used numerous tracking systems and all
of them are bad in some way. I'm sure if/when we get to use GitHub's
we'll realise that tags just don't scale, or that we can't separate
the projects correctly, or that we would have done something silly in
the beginning and not be able to move later one.

Rushing moving away from bugzilla is not a wise decision, IMHO.

I don't think anyone likes or wants to use or administer Bugzilla. It's just what we have right now and we don't as a community have bandwidth to figure out what to do until we've made the repository move. When we have bandwidth, we should just fix this across the project. Probably after we move the repo; possibly with GitHub issues, but I think it should be the same across the project.

Agreed.

I can see how having a completely separate bug tracker could help a little, but I'm skeptical there are major benefits over putting each tool that has a distinct userbase at the top-level.
- clang
- clangd
- (whatever else)
- llvm

For starter, having it all separate would make it harder to move bugs
around when we realise it's not a clang bug but a back-end one, or a
library issue.

My personal opinion is that we should not assume GitHub issues is the
way to go just because we're moving to GitHub for code. GitHub is
portable enough, and mainstream enough, that there are a vast number
of products out there that plug into it seamlessly.

We can even find that we don't need one single tool, but a number of
smaller ones, different for bugs, releases, major features, projects,
administration, etc. There is no technical reason for us to bundle it
all in one single tool.

That's what we've done with bugzilla (with sub-projects and meta bugs
etc) because we had no other option. I'd hate to rush a bug-tracking
move into another end-point and then have to have the same
conversation 5 years from now.

We have lived with bugzilla for the past decade, we can live with it
for another year.

cheers,
--renato

of them are bad in some way. I'm sure if/when we get to use GitHub's
we'll realise that tags just don't scale, or that we can't separate
the projects correctly, or that we would have done something silly in
the beginning and not be able to move later one.

Rushing moving away from bugzilla is not a wise decision, IMHO.

+1. I made a pilot migrations couple of years ago. Just for the
record: I successfully migrated (after some tweaking) the majority of
our bugzilla both you Youtrack and GitHub issues. So, this is
feasible. I would just suggest us to move one step at a time.

I could provide preview version of Youtrack reasonably fast if anyone
is interested. GitHub would require quite some time though.

Fundamentally, I think there’s an underlying question here:
Have the projects under LLVM grown diverse enough that one solution doesn’t fit all any more, at least for docs & bug tracking?

The reason clangd lives in LLVM is a technical one; the user base in terms of who interacts with the community is very different; we want clangd bugs to be reported by folks who would never file a compiler bug (because from their point of view, it’s always a bug in your own code ;). Bug reports for clangd are by their nature less “heavy”, because clangd is not a precise enforcer of a standard, but a little helper to the programmer who is judged by providing value, not whether it is right (according to some standard). For example, “these 2 completions are in the wrong order” is a useful and valid bug report.

I’d agree that we can take it slowly if this was just about making our own lives easier. This is about the lives of our users, though, where I’d argue that making it more frustrating for them is ultimately going to make it hard to deliver a compelling product.

Cheers,
/Manuel

Have the projects under LLVM grown diverse enough that one solution doesn't fit all any more, at least for docs & bug tracking?

That's is a good question, and one that amidst the constant pull to
look at it recently, has made me think about, too.

The reason clangd lives in LLVM is a technical one; the user base in terms of who interacts with the community is very different;

I imagine this is the same reason to keep llgo inside the tree. How
good the technical side of it is is less important when interacting
with the sub-community.

we want clangd bugs to be reported by folks who would never file a compiler bug (because from their point of view, it's always a bug in your own code ;).

I think that's true for all projects. I don't think we should ever
start from the assumption that the user is wrong. It could be problems
in clarity, for ex. documentation, expectation, legacy, which is
different from being plain wrong.

Bug reports for clangd are by their nature less "heavy", because clangd is not a precise enforcer of a standard, but a little helper to the programmer who is judged by providing value, not whether it is right (according to some standard). For example, "these 2 completions are in the wrong order" is a useful and valid bug report.

I'd agree that we can take it slowly if this was just about making our own lives easier. This is about the lives of our users, though, where I'd argue that making it more frustrating for them is ultimately going to make it hard to deliver a compelling product.

I agree.

Some people consider open source software as "no one's land" where the
users have to fight to get support. I think that's 20 years too old.

But I also agree that not everyone knows how to be supportive (I often
struggle with it), and not responding makes users as angry as
responding badly.

Some users also react badly when shown to be wrong (standards, maths,
expectation, legacy, or just plain projects' own decisions), and that
puts off developers from helping again.

I am, however, quite happy to let those who can, to do the best way
possible, even if that means I'll have to click one or two more links
on my side. :slight_smile:

If that means more than one tracking system (one per groups of
projects, multi-tiered, etc), so be it, as long as there are people
willing to maintain it and doing a good job at it.

Brian Cain via llvm-dev <llvm-dev@lists.llvm.org> writes:

Can't we / shouldn't we create decomposed repos by partitioning
commits made to the monorepo into corresponding ones made to
per-project repo "mirrors"? Certainly if we can, we should. Seems to
me that even monorepo commits spanning projects could be broken down
and applied to independent repos. Perhaps it's slightly more
interesting when top-level content starts to be more common, but an
"llvm-top" or "llvm-general" individual repo seems feasible. Even if
they're not the canonical repos for submitting changes, it's still a
valuable idea.

In a vacuum, that might be preferable. But we have existing per-project
git mirrors. There are two ways to partition monorepo commits and both
are bad:

- Create new per-project git repos and partition monorepo commits to
  them

- Erase the existing per-project git mirrors and create new ones in
  their place, partitioning monorepo commits to them

The first creates multiple per-project git repositories and will cause
confusion.

The second effectively rewrites history for downstream users of the
per-project repositories.

The current proposal as I understand it recommends sparse checkouts to
work with something less than the full monorepo. I don't have any
experience with sparse checkouts and we're unlikely to use it so I can't
really comment on how useful it is, except it will prevent the two
scenarios above and therefore seems good to me.

Of course, that's assuming that SVN commits continue to be mirrored to
the per-project repositories (as well as the monorepo) for the time
being. If that's an incorrect assumption I'd like to know now.

                          -David

I haven’t seen a clear description of who clangd users are. The argument seems to be premised on “clangd users are active contributors to some other GitHub project and therefore want/expect a familiar experience for interacting with clangd providers.” Is that actually your target user base?

There are certainly large non-GitHub-based open-source projects out there in the world. It’s your prerogative to hand-wave them away, but you want to understand that you are in fact doing that.

–paulr

The current proposal as I understand it recommends sparse checkouts to
work with something less than the full monorepo. I don't have any
experience with sparse checkouts and we're unlikely to use it so I can't
really comment on how useful it is, except it will prevent the two
scenarios above and therefore seems good to me.

Well, sparse checkout affects only checkout. It does not affect clone
size, directory layout, history, etc. So, this means that downstream
users will still have to merge the whole monorepo somehow and tweak to
their needs (e.g. via manual history editing or force removal of
unnecessary projects). To my opinion this will create lots of problems
to downstream users unless we will provide them working solution (=
scripts) how to essentially replicate the current workflow with the
new monorepo.

I haven’t seen a clear description of who clangd users are.

Good question!
The target audience is all C++ developers using editors where external IDE features make sense. (vim, emacs, vscode, sublime… not visual studio or notepad).

The argument seems to be premised on “clangd users are active contributors to some other GitHub project and therefore want/expect a familiar experience for interacting with clangd providers.” Is that actually your target user base?

Replace “active contributors to” with “users of”, and that’s a pretty reasonable estimate in the near-term.
(I do hope we eventually reach some users who have never ventured beyond “apt-get install”, but that’s further out)

There are certainly large non-GitHub-based open-source projects out there in the world. It’s your prerogative to hand-wave them away, but you want to understand that you are in fact doing that.

–paulr

The assumption is not that the user’s primary work is done on Github, but that they’ve interacted with some project that is hosted there.
I’m sure that’s not everyone, but it’s an awful lot of people.

Cheers, Sam

Thanks! Being a “user” of GitHub projects does not require having a GitHub account (having downloaded some myself, and not having a GitHub account whose name or associated email address I can remember). But if it’s commonplace to require a GitHub account in order to file a bug against any of these (I don’t know, I’ve never tried), then at least the proposal is following common practice and not something unusual.

FTR, I would dispute pretty much all of your original Point 1, as well as the later statement that “this is why we’re moving the code to GitHub” but that’s all not really relevant to the TL;DR of your proposal, which I might summarize this way:

  • clangd is (about to be) hosted and available on GitHub, so

  • it ought to present itself in ways that GitHub projects usually do.

The move-to-GitHub discussion is explicitly NOT addressing trying to move the overall bug-tracking system; but you might have a valid reason for popping up a separate one for clangd, and now I understand what you’re driving at (IIUC, please correct as needed).

Thanks,

–paulr

There are already a lot of responses and I haven’t read them all.

I just wanted to say that, if there is consensus among the clangd developers that this is what they want to do, they should feel empowered to do it. You are the ones best positioned to make the right decision for your project.