This is an attempt to summarize my digital scribbles from the Presubmit Testing roundtable at the dev meeting on 11 October 2023.
The participants were obviously a selection of people especially interested in presubmit testing. It would be interesting to hear thoughts from the rest of the community too!
In Chromium, about 50% of the issues we hit when trying to update the toolchain are basic issues which could have been caught by better presubmit testing
Hans’s dream: Whenever someone uploads a patch for code review, the system would automatically tell them: will this break the basic build or lit tests on Linux, Windows, or Mac
Should presubmit tests be blocking?
It’s generally easy to get people to agree to having presubmit testing that catches lit failures on the basic platforms
But it’s hard to get people to agree on blocking tests, as there are concerns about velocity etc.
There is a sense that blocking presubmit tests would not go over so well in LLVM
People at the table seemed to mostly favor a system where presubmit tests are mandatory (failures would block merging), but with a bypass mechanism
Step 1 might not be to make it mandatory, but that could be a goal
Not making it mandatory can give negative signal to the community: tests always being red due to failures on trunk, etc.
You have to stand down and say that some checks are requires: start with smallest possible, so people buy in, then add to it from there.
You can always start with “does it compile”. Even that catches lots of issues.
If tests are mandatory, or even strongly advised, what does it mean if I bypass and get it wrong? Social stigma?
We’ve heard many people say extensive mandatory testing is good. Two folks said efficiency is a concern.
Is there an official LLVM policy to rever to green? Yes.
Then a presubmit failure is strictly better than a post-submit failure and revert to green
But green is ambigous: what platforms, etc…
We have a lot of buildbots. What if all of those became the presubmit testing. What’s the concern? Cost?
It’s expensive and labor intensive to maintain
Also commonly agreed problems with the buildbots: they take a long time, and batch changes — you get an email hours or days later about a breakage and it’s not clear how it relates to your change.
Suppose the buildbots are divided into “official” and “non-official”, where the official ones are non-batching and made part of presubmit testing?
There’s a risk that everyone wants to add their tests. Where do we draw the line? Who owns that? Area team?
Phases of Development
Maybe testing should be different in different phases of development
GCC does phases where breaking stuff in the early ones is not so problematic, but then in later stages you’re not allowed to break anything
LLVM does stabilization on release branches
Maybe we should say that when we start looking at things going into the release, that’s when we tie things down?
We have the technology, this is not rocket science, but at what stage it’s done affects the velocity of development.
Supposedly GCC now has presubmit testing, implemented by Linaro (not sure where we heard this)
Release process notwithstanding, projects seems to identify a basic bar that they want to check before committing changes
It doesn’t have to be everything
We have developers who want to be able to get patches in quickly. People may urgently be waiting for something to get merged, to unblock them. Having to wait 2-3 hours will set people on fire.
There’s always overrides
Letting breaking changes into trunk can also slow velocity: while I’m working on fixing something, someone breaks something else and it’s harder for everyone to make progress
For feature work, it takes weeks to get a patch reviewed, so another six hours to get it merged may not be a big issue, so long as there is an override for emergencies.
Getting an email two days later saying that you broke something is very annoying. Would love an option to say “please submit my patch once you’ve run all the tests”. Don’t care what everyone else does, but it would save me trouble.
Related to the keynote on compassion in open source: Presubmit testing may cause my velocity to go down, but everyone else’s may go up since they are not blocked on red bots — helps the community, perhaps for a little expense of the individual
“One or two hours is long enough that I’d say absolutely not and go around it”. 10 minutes absolute maximum.
LLVM has strongly pushed people towards incremental workflow
We often split out unrelated changes: NFCs, pre-submitting test baselines
This needs a tight iteration loop
At the same time, those NFC one-liners are exactly what one would want presubmit testing for
Having a minimum test bar with fast cycle, and on opt-in “test the heck out of this” would make total sense
Experience from libc++
Presubmit testing on every PR
Very extensive, almost 60 jobs
All supported configs — having presubmit tests is a requirement for being a supported platform
Very costly, but invaluable for quality, they love it
Almost never have to revert changes
Better experience for contributors, especially new ones: don’t get their patches reverted
The main problem is bandwidth: the testing is slow,
One solution is setting up a merge queue: only run a small set of tests on PR upload, then after review, press a button and let the merge queue run full tests before merging if they pass
There’s a github feature for this (it’s per-project, but maybe it could be set up to only run tests for libc++ changes for now)
Would greatly reduce strain on main CI, give faster turnaround, and maintain exhaustive testing on every PR
We like this so much, we think most libc++ devs would be happy to make the merge queue mandatory
Rolling this out saw little pushback from the libc++ community, but it was also very small at the time.
Some have heard that developers complain libc++ premerge testing is expensive and slow, though.
Merge queue should solve that.
One of the bit benefits is having this configured in the main repository
People are eager to add coverage for new platforms: can just send patch
Not dependent on buildbot master restarts, config changes are just a patch
Experience from Swift
We’ve had presubmit testing for 9+ years in public
Testing is mandatory before merging anything
Build LLVM, Swift, and run all the tests
Do people complain? The complain about the speed, but see the benefits. And it’s not optional.
It was slow originally, but using caching sped it up.
The infrastructure is costly, but it outweighs the engineering cost.
You have to add a comment to ask for testing
You can ask for smoke tests, that’s what’s required to merge
You can ask for full tests, that gives full coverage of each platform
You can ask for benchmarks, which prints out what you regressed on the PR itself
It’s all about the waterfall, giving people the ability to do the testing they think is right for their change
But there needs to be a minimum set of mandatory tests
Running on Jenkins backed by a cloud service provider
Make it easy to test
One benefit or presubmit testing is that it makes it easy for developers to test
Especially for platforms that they don’t have access to
Especially for newcomers
What we have today (outside of libc++)
If this is such an obviously good idea, why haven’t we done it yet? Some of us had this exact same conversation last year.
We do have it mostly: presubmit testing existed in Phabricator, and we moved it to GitHub PRs.
There some sense that it doesn’t work great though:
Often fails for unrelated reasons “red all the time”
There’s a huge backlog, takes like 24 hours to get a result (people had mixed experiences)
Can we get some data on how often it fails?
Testing is Linux only, or did we also have Windows? Definitely no Mac
Maybe there are two reasons:
Nobody has stepped up and said they’re going to make this great for their project (like libc++)
Machines. It’s hard to get enough machines.
We run self-hosted on a mix of machines, mostly very powerful.
Limited testing based on patch contents
Let’s say you touch the X86 backend, you’re probably not going to break ARM
Maybe then a smaller set of tests could be run?
Seems hard to implement with any accuracy in our current build
Also watch out for implicit interactions, e.g. a libc++ change can break lldb data formatters
The fact that LLVM is an inherent cross-compiler is in large part tied to our testing
For example, risc-v tests will mostly run on x86, because that’s where most of our tests run
If I could use presubmit checks running on risc-v, maybe i only care about my tests running on risc-v
If we move to a model where presubmit bots runs a bunch of things on different platforms, the developer loses the value of writing platform agnostic lit checks
Seem pretty unloved
libc++ thinks we should ditch buildbots and just go with what they have instead
Do we have statistics? It feels like finding a good LLVM revision was easier a few years ago. Teams could get weekly releases out, now they’re happy if they can do it monthly.
We don’t have organizational power to say no to a failing buildbot.
How to make progress
If we’re concerned about libc++ presubmit speed, maybe the first thing we should do is help make that better
If we get libc++ presubmit testing looking really good, we have a showcase, and then other projects can start adopting the same system
To start incrementally, maybe we should start with the runtimes?
Put this through the RFC process? Mandatory tests seems like a big change.
Oh I missed that there was a round-table on this topic, it’s one that it super important! Thanks for taking the time to write notes and organize it @hansw2000 : highly valuable.
This is what I call the “fire and forget” mode: I don’t mind that my commit does not land immediately and will take 2h. However I care about the overhead for myself:
“Fire and forget” means I don’t get more overhead landing the commit, assuming the test passes.
I only get to do some work if there is a failure: but if there was a post-merge failure I would also have to do some work, so this isn’t overhead! (on the contrary actually, it removes the urgency to fix).
Caveat: this requires a high-confidence that a CI failure means that “my patch is broken” (and not: flaky test, infra failure, trunk broken, etc.).
→ I need actual zero-overhead for my NFC changes (and wiling to wait 2h for these to land).
that I haven’t seen addressed here is revert: a post-merge buildbot is broken and I need to revert, can I push directly? Are we delaying getting back to “green”? The revert could also fix the bot but break the premerge config for another reason
“mid-air collision”: unless you serialize everything (is it realistic?), trunk will break because of conflicting changes landing at the same time: do we block everyone’s contribution until trunk is fixed?
Testing time seems quite OK actually, it likely depends on what you’re working on, here is a run on a recent PR: Github pull requests #8565
Because it touches MLIR, it’ll also run the flang tests, but it won’t run the clang or LLVM tests (just build what’s needed as dependencies).
Linux build/test: “Ran in 7m 34s”, “Waited 8m 47s”
Windows build/test: “Ran in 24m 34s”, “Waited 45m 55s”
I actually like the service buildbot provide, in particular I don’t quite see what the alternative should be?
But regardless, we should discuss requirements, that’s the best way to figure out what actually solves the problem!
In particular I believe that a post-merge system has different constraints than just “running the build and reporting a red/green signal” which can be enough for pre-merge testing.
For example post-merge systems have to be able to decide when to batch commits when a bot can’t keep up. They need to keep track of the history, if only to build a blame list. Then there is a need for notification of the blamelist (and build cop) when something fails. A post-merge bot can also do incremental build differently.
Investigating failures is also putting pressure on dashboarding: when did this bot started failing? Was this bot flaky? For a given change, where has it been tested? (this last point is not straightforward when you include commit batching! GitHub Build status don’t help AFAIK…).
Oh and for roll-out: my take is that if you have a “fire and forget” system, that is just providing enough value: people will naturally use it and the transition to “mandatory” becomes easier.
Over time, the few people who “bypass” and break the CI will get peer pressure (“please run the CI next time” comment on their reverted PR…).
Step 1 is “proving” that a robust system is possible (and not for a small isolated component, rather for at least LLVM+Clang+CompilerRT), and actually can be “zero overhead”.
Going from “opt-in” to “mandatory blocking” should be a non-controversial decision based on existing practices, we have yet to get there IMO.
I think the first thing we need is some community compute resources that anyone can easily use. GitHub Actions has this, and just in the last 2 months we already have a few pre-commit jobs that check code formatting and test build the documentation.
However, the free tier GitHub Actions machines are not really powerful enough to implement the kind of CI we are talking about here, so only people with access to large compute resources are able to experiment and develop the more resource intensive CI systems that we need. I really would like to change this so that anyone can contribute to a new CI system.
I have been looking recently at how to set up a cluster of self-hosted GitHub action machines. I found this blog post which describes a very efficient way to manage the self-hosted runners. If anyone has experience with kubernetes, terraform, or any of the cloud services and is interested in working on this, please let me know and we can try to get something up and running.
We talked about setting this sort of thing up a little bit at EuroLLVM last year (details start around RFC: Prototyping Pre-commit Testing Using Buildbot - #27 by kwk). One of the main things that was brought up was the cost of running pre-commit CI. I’m not sure who’s interested in footing the bill for this sort of thing.
Also, the pre-merge CI right now does somewhat work. I know there’s been some interest in moving the infrastructure over to Github actions, but not sure what the progress has been like there. I think that would be a great opportunity to consolidate everything (rather than having stuff in some google/ github repo) while also making it easier for third parties to hack on. Converting that over would also somewhat solve the problem of who’s paying for the compute and would prevent duplication. I’m not too familiar with the cloud setup that they use currently. I think it might even be Kubernetes on top of GCP, but not entirely sure.
I have some experience with self-hosted Kubernetes, but none with Terraform/OpenTofu on any of the major cloud providers. I’d be interested in contributing to these efforts once we have consensus on how we want to set everything up, but I probably won’t have sufficient amounts of time to really get things working until I’m out for winter break in December. Can probably contribute sporadically between now and then though.
Seems to me that while the question is interesting, the answer does not directly apply to a CI environment where it’s all about the caching infrastructure: when a large part of the build is cached, the actual time is dominated by running the tests.
Edit: for example one of the MLIR buildbot (GCP VM with 16 cores and 32GB) which does a clean build but using ccache:
32s to run cmake
37s for the build (thanks ccache)
1min16 for the tests (and MLIR tests are much much smaller than LLVM).
There’s integration with Github actions (job statuses get reported in Github actions), but from what I can tell all the pipeline definitions and general infrastructure are managed through buildkite. https://github.com/llvm/llvm-premerge-checks is still actively maintained and the permerge checks link to buildkite from within Github. @goncharov can probably provide more info on that though.
Does anyone know about how many cores are needed to build llvm+clang+compiler-rt in 10 minutes?
32 core virtual machine does this just fine as a daily driver.
From a large branch rebase (basically a clean build, check target):
From an incremental rebuild/test (check-llvm target) on the same commit:
The later may seem unfair as it’s incremental, but well, why is it unreasonable for CI to be incremental? And this case is the normal one for iterative development - add a new test, run check, submit, rebase, patch, run check, update review.
Hi Tom, as I replied somewhere already - I am down to do that. But probably after moving from Phabricator completely - then we will be able to move VMs from serving buildkite to self hosted GitHub runners.
https://github.com/llvm/llvm-premerge-checks is still here but it’s not orchestrating anything really. It only contains the code for Docker images and small server for Phabricator integration. All “interesting” parts are now in llmv-project/.ci folder that generates pipelines.
On this note: I don’t see much interest in contributing to CI even after it got moved to llvm-project sources. Probably it’s becase there is no good documentation on how it works and how to modify / test things. What will be a good place for it? Just a README.md in this folder? Maybe something in the official docs on web?
10 minutes for basic build is really hard to get, I think it’s only reasonable to expect this from linux builds. Windows might be able to complete the build but not tests for sure.
I wonder if there is a way to have a “optional commit queue” - so I can give away my CL for submission if I am not in a hurry but otherwise break glass and submit as today. I think that introducing even small bumps for the blind NFC commits to main will have a positive impact on how people land changes - e.g. I submit straight to main as it’s just so easy. One step forward here would be to only allow commits that have a PR (save bots of course).
That’s a great idea, I don’t recall discussing this before, and I feel like it could be a clear path forward to get where we want to go that creates value for the community along the way.
I think empowering the community is broadly good, but I want to push back a little on the idea that we as a project are going to liberally allow folks to add new resource intensive premerge checks. We are currently undergoing a period of rapid development with the new PR system and I don’t want to slow that down by imposing new process, but in the long run, as a community, we need to maintain a high quality bar for new premerge checks.
One of the questions I tried to ask at the roundtable and would like to get a clear answer for from the community is, what is our latency tolerance for premerge testing in a fire & forget system? Suppose we have a command, like “@llvmbot /testandmerge”, what should our latency service level objective (SLO) be, in minutes? That budget should guide our curation of tests, build configurations, compute resource allocation, etc.
I want to emphasize that there is no right answer to this question, everyone I ask gives me a different answer. Some folks want rapid iteration, and some folks want extremely reliable landing, with little to no risk of reverts and follow-on communication of issues after landing. I think the best way to get a clear answer is probably to start a new thread with a survey using buckets for time. So, let’s table this question for now, and look forward to a new thread in the future.
I don’t think we should let anyone add any premerge checks that they want either. I just want to make sure there are resources available so anyone can design and prototype a CI system. Without community resources, the number of people who can contribute to this is very limited.
I’ve seen studios take exactly this approach. There’s an optional build queue that will run build and CI before merging. If it fails, it gets bounced back to the submitter to try again. People who go direct and break the build will take heat for it. Adoption tends to go up because it helps keep things (at least the main things) green.
Independently (IMO) would be nice to have a way to request a specific bot to build/test my fork, so I can diagnose/repair the failure after someone reverts my (non-CI-breaking but other-config-breaking) commit. But that’s separate from Presubmit Testing.
The build queueing system I’ve seen yes serializes everything. The build might be parallelized but the queue is not (this is inherently how git works). If two people queue conflicting changes, the first one wins and the loser needs to do the accommodation.