Google's Plan for the LLVM Presubmit Infrastructure

Hey,

There’s been a fair bit of discussion around the LLVM Presubmit, and some recent changes around this inside Google, so it’s time for me to share our plans to get LLVM Presubmit in a sustainable and useful place where it can offer the most benefit to the community.

I’m going to use the generic term presubmit here to refer to the Google-provided build and test of LLVM on Windows and Linux. There are certainly other presubmit things happening within LLVM, but for the purposes of this post I’m narrowing the definition down to only this.

Current State and the Recent Past

When someone opens an LLVM PR on GitHub, GitHub sends a presubmit webhook to BuildKite, which invokes a build on our servers within Google Cloud.

Some of the configuration of this system is on GitHub, some is on BuildKite, some is in the separate llvm-premerge-checks repository, and some is in the .ci directory of the monorepo. The Google Cloud servers that ultimately do the work are in a Kubernetes cluster hosted on an internal Google account, which for security reasons doesn’t allow non-Googlers administrative access.

In terms of ownership, while we’ve had several volunteers in the past, I’m the current responsible individual from the Google side, and I’ve recruited a few other people from within Google to help out.

There are a number of ongoing issues with this infrastructure. It has been unreliable at times, and because the volunteers maintaining it from the Google side have been doing so in their spare time, sometimes they have been unresponsive to timely issues, because other priorities at work have held their attention. This has led to burnout on the Googlers side, and frustration in the community. In particular, the Windows builds are taking way too long, leading to queuing delays and other problems in the system.

I have also recently learned there is some existing Buildbot infrastructure that we are maintaining, mostly because it is also facing issues, and am looking in to those as well.

The Plan

My desired end state is to transition the presubmit system to be operated by the community, while Google continues to provide the (not cheap!) computing infrastructure to the community for the benefit of all. However, I have a few concerns about immediately switching the infrastructure to community maintenance:

  1. The system is not up to my standards of reliability, stability, and quality, and we need updated documentation.
  2. We’re spending Google’s money on this infrastructure. We want to keep funding this, and so we want to iterate on some aspects of the design quickly (and while the above standards and documentation may not yet be met) to ensure the resource usage is sustainable long-term.
  3. The governance model for LLVM is still up for decision. The Foundation is also actively working on Infrastructure plans, and I’d like fewer moving pieces on some of this until the boundaries and expectations are more clear.

Stabilization and Productionisation

Here are the issues that I’d like to address. I think, if it makes sense, I’ll make GitHub issues tracking each of these so interested parties can follow along there:

  • Change from BuildKite to GitHub Actions: started investigating. Note that there are a number of other services besides the Google-provided presubmit that are configured on BuildKite, but those aren’t germane to this plan.
  • Set up analytics, logging, alerting: started investigating. Right now if there’s a problem with the servers I rely on someone contacting me on Discourse or Discord. I also can’t speak to the average or peak queuing delay for the Windows builds, for example. Without this, we can’t objectively assess the quality of service in the system or respond timely to ongoing issues.
  • Get queuing delays under control: started investigating. In particular the Windows builds take a great deal of time to execute. I’m looking into a few solutions here but I need to finish getting conclusive results so I can share something concrete.
  • Potential consolidation: discussions ongoing. There are a number of other Google-hosted or Google-provided presubmit services, for example LNT or the libc premerge testing, does it make sense for them to settle on a common, shared infrastructure? What would we need to change in our infrastructure in order to accommodate them? Should we provide a channel for official release builds? Is there a desire for the Docker images used in presubmit to become official images for release or self-hosted dev builds? Happy to discuss these sorts of things with folks as they come up.
  • Document existing system: blocked on stabilization. I anticipate some non-trivial changes to the current architecture and tooling due to the move to GitHub, the logging, and the queuing delays solution. Once those issues are resolved, I think we can document the administration of the system so that new contributors can onboard with less pain.
  • Move everything to the monorepo: not started. I think it makes sense to consolidate the configuration files and documentation files here, in anticipation of moving to the shared governance model.

Transition to Shared Governance

On-call maintenance of a service like presubmit for LLVM takes work, and some of that work will invariably be toil. I will do my best to minimize the toil during the stabilization period, but I think any plan for shared governance of infrastructure also needs to address the shared toil.

For equitable work sharing within Google, we typically establish an on-call rotation for system administration. I would advise that any party that wants to participate in the presubmit governance would also join the rotation. The community would need to establish SLAs (requiring the analytics to be set up) and an escalation process. These are the kinds of decisions that I’m hopeful we can rely on LLVM governance and the Foundation to help us navigate, once the system is ready to hand over.

Next Steps

Work is already underway on the stabilization front. If there are other urgent issues requiring immediate attention please let me know. For now, I can be the point of contact for the Google-provided infrastructure. If you have feedback, questions, or ideas you’d like to share, please do get in touch.

Thanks!

19 Likes

Note that if this is a cost concern I know in the past that buildkite happily funded accounts for open source projects, and I imagine LLVM would qualify.

AFAICT there’s no cost concern from the Google side. Rather, with the move to GitHub, there’s been a lot of discussion of just using GitHub self-hosted builders directly, removing some of the complexity of this intermediate step. @Keenuts and @gchatelet had graciously volunteered to look into this from our side, perhaps they can add more.

1 Like

Hey Folks,

We’ve been working away on the new LLVM Premerge Infrastructure, and are going to start Beta testing it soon. As we’ve been developing the system it currently only runs on commits, not on PRs. The plan for the Beta test is to keep the current infrastructure in production while enabling PRs to also be tested on the new infrastructure. During beta testing, PRs will have a new Github Actions check labelled LLVM Premerge. Initially we will have these set to pass regardless of whether or not the build/test passes or fails, purely to test the infrastructure.

During Beta we plan to stabilize the infrastructure under the increased load, determine appropriate scaling, and gain confidence in its quality. Once we’re satisfied with the latency and reliability of the new system, we’ll plan a production “launch,” which will mean we’ll deprecate the legacy system and rely on the new one as the “authoritative” system (in as much as the legacy one is) going forward. At that point, we will allow the new checks to fail if the build/tests fail. If you notice any issues with job failures when you do not expect it to be failing, please file an issue on Github.

We’ve set up a public dashboard with analytics tracking the performance of the system here: Grafana

We’re building some documentation in the llvm-zorg repository, you can find it here: llvm-zorg/premerge/README.md at main · llvm/llvm-zorg · GitHub

Lastly, we’ve gathered a few other volunteers within Google and are establishing an on-call rotation with alerting, so you’re likely to see some new names from people working on the new infrastructure. For now we’re only testing the alerting and rotation mechanisms, so there’s no SLA offering, we don’t have complete timezone coverage, and have no coverage outside of business hours.

As always we welcome your feedback, feel free to reach out to @boomanaiden154-1, @Keenuts, or me directly.

Cheers!

Lucile

8 Likes

Thank you for continuing to work on this!

Thanks for all the work here!

(& re: the original post’s mention of SLAs and governance, speaking on behalf of the Infrastructure Area Team: I think we’re definitely interested in creating and monitoring SLAs for premerge testing in particular (as an opportunity to reset expectations compared to the postmerge testing that has a history of being noisy/slow/etc))

Does this mean, for now, the action will run and take the real time required to execute - but then always succeed? (so if this premerge testing is slow, we’ll still observe that slowness in the status on the PR?) Can we ensure that the “This is going to pass anyway” aspect is communicated through the UI, (perhaps by naming the action in a way to communicate that it’s always going to be green and/or can be ignored?) so people aren’t waiting around for a result that’s always green anyway?

Yes. The actions will actually run like they normally do, but will always report success.

The only way to do that would be to adjust the job naming. There’s not really another way in the Github UI to mark it as testing/will always succeed. We can rename the jobs though if people feel that would more clearly communicate expectations.

That’d be my vote, but don’t feel too strongly about it.

Renaming might help ensure we don’t burn more confidence in premerge checks by giving people another reason to lose confidence in them (thinking they are meaningful/waiting for them only to find they’re still experimental/always passing/maybe slow/etc)