Hey,
There’s been a fair bit of discussion around the LLVM Presubmit, and some recent changes around this inside Google, so it’s time for me to share our plans to get LLVM Presubmit in a sustainable and useful place where it can offer the most benefit to the community.
I’m going to use the generic term presubmit here to refer to the Google-provided build and test of LLVM on Windows and Linux. There are certainly other presubmit things happening within LLVM, but for the purposes of this post I’m narrowing the definition down to only this.
Current State and the Recent Past
When someone opens an LLVM PR on GitHub, GitHub sends a presubmit webhook to BuildKite, which invokes a build on our servers within Google Cloud.
Some of the configuration of this system is on GitHub, some is on BuildKite, some is in the separate llvm-premerge-checks repository, and some is in the .ci directory of the monorepo. The Google Cloud servers that ultimately do the work are in a Kubernetes cluster hosted on an internal Google account, which for security reasons doesn’t allow non-Googlers administrative access.
In terms of ownership, while we’ve had several volunteers in the past, I’m the current responsible individual from the Google side, and I’ve recruited a few other people from within Google to help out.
There are a number of ongoing issues with this infrastructure. It has been unreliable at times, and because the volunteers maintaining it from the Google side have been doing so in their spare time, sometimes they have been unresponsive to timely issues, because other priorities at work have held their attention. This has led to burnout on the Googlers side, and frustration in the community. In particular, the Windows builds are taking way too long, leading to queuing delays and other problems in the system.
I have also recently learned there is some existing Buildbot infrastructure that we are maintaining, mostly because it is also facing issues, and am looking in to those as well.
The Plan
My desired end state is to transition the presubmit system to be operated by the community, while Google continues to provide the (not cheap!) computing infrastructure to the community for the benefit of all. However, I have a few concerns about immediately switching the infrastructure to community maintenance:
- The system is not up to my standards of reliability, stability, and quality, and we need updated documentation.
- We’re spending Google’s money on this infrastructure. We want to keep funding this, and so we want to iterate on some aspects of the design quickly (and while the above standards and documentation may not yet be met) to ensure the resource usage is sustainable long-term.
- The governance model for LLVM is still up for decision. The Foundation is also actively working on Infrastructure plans, and I’d like fewer moving pieces on some of this until the boundaries and expectations are more clear.
Stabilization and Productionisation
Here are the issues that I’d like to address. I think, if it makes sense, I’ll make GitHub issues tracking each of these so interested parties can follow along there:
- Change from BuildKite to GitHub Actions: started investigating. Note that there are a number of other services besides the Google-provided presubmit that are configured on BuildKite, but those aren’t germane to this plan.
- Set up analytics, logging, alerting: started investigating. Right now if there’s a problem with the servers I rely on someone contacting me on Discourse or Discord. I also can’t speak to the average or peak queuing delay for the Windows builds, for example. Without this, we can’t objectively assess the quality of service in the system or respond timely to ongoing issues.
- Get queuing delays under control: started investigating. In particular the Windows builds take a great deal of time to execute. I’m looking into a few solutions here but I need to finish getting conclusive results so I can share something concrete.
- Potential consolidation: discussions ongoing. There are a number of other Google-hosted or Google-provided presubmit services, for example LNT or the libc premerge testing, does it make sense for them to settle on a common, shared infrastructure? What would we need to change in our infrastructure in order to accommodate them? Should we provide a channel for official release builds? Is there a desire for the Docker images used in presubmit to become official images for release or self-hosted dev builds? Happy to discuss these sorts of things with folks as they come up.
- Document existing system: blocked on stabilization. I anticipate some non-trivial changes to the current architecture and tooling due to the move to GitHub, the logging, and the queuing delays solution. Once those issues are resolved, I think we can document the administration of the system so that new contributors can onboard with less pain.
- Move everything to the monorepo: not started. I think it makes sense to consolidate the configuration files and documentation files here, in anticipation of moving to the shared governance model.
Transition to Shared Governance
On-call maintenance of a service like presubmit for LLVM takes work, and some of that work will invariably be toil. I will do my best to minimize the toil during the stabilization period, but I think any plan for shared governance of infrastructure also needs to address the shared toil.
For equitable work sharing within Google, we typically establish an on-call rotation for system administration. I would advise that any party that wants to participate in the presubmit governance would also join the rotation. The community would need to establish SLAs (requiring the analytics to be set up) and an escalation process. These are the kinds of decisions that I’m hopeful we can rely on LLVM governance and the Foundation to help us navigate, once the system is ready to hand over.
Next Steps
Work is already underway on the stabilization front. If there are other urgent issues requiring immediate attention please let me know. For now, I can be the point of contact for the Google-provided infrastructure. If you have feedback, questions, or ideas you’d like to share, please do get in touch.
Thanks!